xml concepts - campus.unibo.itcampus.unibo.it/3996/2/w2-xml.pdf · xml concepts prof. andrea...

76
XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università di Bologna a Cesena

Upload: others

Post on 08-Sep-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML ConceptsProf Andrea Omicini

Distributed Systems Sistemi Distribuiti L-AAY 2007-2008

Alma Mater StudiorumndashUniversitagrave di Bologna a Cesena

Outline

Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX

2

Introducing XML

What is XMLA W3C Standardhttpwwww3orgXML

A mark-up language for text documentsderived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language

A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 2: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Outline

Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX

2

Introducing XML

What is XMLA W3C Standardhttpwwww3orgXML

A mark-up language for text documentsderived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language

A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 3: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Introducing XML

What is XMLA W3C Standardhttpwwww3orgXML

A mark-up language for text documentsderived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language

A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 4: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

What is XMLA W3C Standardhttpwwww3orgXML

A mark-up language for text documentsderived from SGML (Standard General Markup Language)

ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language

A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip

A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies

4

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 5: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

What XML is not

XML is nota programming languagea network-transport protocola document presentation languagea database (manager)

It can be used (and it is actually) in all of those contexts but it remains a markup language

5

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 6: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Why Markup LanguagesMarkup

encoding embodied in the document specifying document properties as well as properties of information contained

for instance formatting instructionsmore generally structural semantic information

knowledge vs dataMarks Markups

tag used to qualify label text chunkseg HTML tags

XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt

6

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 7: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML X for eXtensibilityBasic idea of XML

a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages

ThenXML is quite free in generalit can be ldquoextended

actually specialisedto define more specific ad hoc markup languages

No predefined XML markups as it happens instead in HTMLthey need to be defined

who does define themcan we do this how

7

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 8: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Hey too many Languages already

Application domains are more and morenumerouscomplexspecific

Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations

Engineers working with computational ICT systems will be called to use a number of different artificial languages but also

to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages

ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo

8

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 9: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Applications

XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex

XML ApplicationsXML-defined markup languages

defined through a precise syntaxDTD or XML Schema

they may be either standard or customMost standard XML applications are W3C

such asXSLTXML SchemaXHTML

9

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 10: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML for Portable Data

Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format

Text text textboth data and markupall in the XML file

XML document structure simple amp cleareasy to parsewell-documented

That is why XML is already everwhere

10

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 11: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

How XML Looks likeltxml version=10 encoding=utf-8gt

ltdocrootgt

ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt

ltbodygt

ltpgtA list of things I likeltpgt

ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt

ltbodygtltdocrootgt

11

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 12: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

How XML Looks like from a Browser

12

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 13: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

How to Work with XML

XML is textso any text-editor is perfectly fine

A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better

Visualisation is a different matterbrowsers do something

but XML is not a presentation language sohellipwe need to understand

what an XML document ishow XML works

13

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 14: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

What is an XML Document

It can beA text fileA record in a databaseA run-time construction in memoryhellip

In any case it can be handled and trasmitted by any system capable of dealing with text documents

ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt

14

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 15: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

How does XML WorkWho handles XML documents

after it has been producedhow why

XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax

XML validating parserswhen applicable

there is either a DTD or a Schemachecking validity

Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc

15

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 16: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Where is XML actually used

Everywhere already

16

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 17: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from

HTML was the first successful application of SGMLbut had obvious limitations

too complexmore than 150 pagesnever implemented fully

too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)

XML 10 (February 1998)Then a flow

namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc

17

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 18: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Fundamentals

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 19: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

A Simple XML Document

ltplayergt Carlo Nervoltplayergt

19

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 20: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Document amp Files

This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms

Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml

ltplayergt Carlo Nervoltplayergt

20

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 21: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Elements amp Tags

The document contains a single elementof type player

Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt

In between the tags lays the elementrsquos content Carlo Nervotags are markup

the most common form of markup but there are other kindscontent is character data

including the white space between Carlo amp Nervo

ltplayergt Carlo Nervoltplayergt

21

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 22: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Tag Syntax

Very similar to HTML tagsat least superficially

lttaggt for start tags lttaggt for end tagslttag gt for empty tags

tags with no content like ltbr gt or lthr gtXML is case sensitive

so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML

HTML JavaScript amp XHTML hellip

22

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 23: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Trees A Simple Example

player

name surname team team

Carlo Nervo Bologna Mantova

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

23

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 24: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

An XML Document is an XML Tree

An XML Document has a tree-like structureone and only one root

root element or document elementeach node element can have one or more child elements

each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements

Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted

nesting needs to be perfect overlapping not allowed24

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 25: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player

After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt

hellip

ltbiographygt

XML Documents for written narrative such as articles reports blogs books novels

elements with mixed contentnot easy for automated processing and exchange

25

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 26: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Attributes

Elements can be labelled by attributesattributes are specified in the start tag

and in the only tag of empty elementsany number of attributes can be in principle associated to an element

An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element

Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

26

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 27: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Using Elements or Attributes

Attributes are for meta-data about the element and content is information of the element

maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based

attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element

Attributes are quite useful in narrative-based XML documents

ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt

27

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 28: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs

to increase efficiency and abate complexityAn XML name can include

any letterlatin or even non-latin like ideographs

any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces

An XML name may not include other punctuation signs nor any sort of white spaces

and can begin only with letters ideographs or underscore28

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 29: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure

so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code

All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char

Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt

becomes the parsed character dataBatman amp Robin

29

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 30: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Entity References

ampentityreferencean entity is something defined outside the normal flow of the XML document

out of the XML treeused for constants common values external values etc

through an entity referenceUsers of any sort may define their own entities

well see how soon for instance through DTDs

30

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 31: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Pre-defined XML Entities

Markup Entity Descriptionamplt lt less-thenampgt gt grater-than

ampamp amp ampersandampquot double quoteampapos single quote

31

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 32: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

CDATA Sections

Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show

CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters

After parsing no way to tell where a text came from a CDATA section or not

32

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 33: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Comments

Easylt-- Comment --gt

It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure

they can appear anywhere even before the root elementbut not inside a tag or a comment

Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs

to give info to a computational agents processing instructions

33

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 34: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML Processing Instructions

Need to pass information for a given application through the parsercomments may disappear at any stage of the process

Processing instructions have this very endlttarget hellip gt

The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt

A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root

34

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 35: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

The XML Declaration

Looks like an XML processing instructionbut it is not just the XML declaration

It is optionalbut if there should be the first thing in the document absolutely

not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt

Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)

optional default UnicodeStandalone means that it has no external DTD

optional default no

35

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 36: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Checking Well-FormednessMain rules

perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip

Tools on the WebJust look around

36

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 37: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DTD

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 38: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Flexibility or Rigidity

XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario

Sometimes some strict rule is requiredsome control over syntax should be enforced

like a football player should have at least one teamDocument Type Definition (DTD)

to define which XML documents are validValidity is not mandatory as well-formedness

how to handle errors is optional

38

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 39: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Validation

A valid XML Document includes a DTD the document satisfiesMain principle

everything not permitted is forbiddenthat is DTDs specifies positive examples

Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid

Many things a DTD does not saywe stick with what we can specify

39

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 40: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DTD ishellip

SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions

It allows XML designers to define a grammar for their documentstypical syntax-based approach

maybe limited but easy to implementMaybe DTD is not the future of XML document validation

XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still

40

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 41: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

A Simple DTD Example

We do not go too deep into DTD syntaxwe just look at the example above and comment

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

41

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 42: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DTD Declaration

DTD is declared here as internalbut could be declared separately

ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource

ltDOCTYPE football_player SYSTEM httphellipgt

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

42

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 43: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DTD Declarations Define or Use

So you maydefine your own DTD and

either include it in your XML documentor save it as an independent document and refer from one or more XML docs

or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs

43

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 44: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Element Declarations

A player element contain one name one surname and one or more teams

in that precise orderand they are just parsed character data (PCDATA)

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

44

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 45: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Some Syntax is for sequence

to define ordered lists| is for choice

to provide for alternativessuffixes

for zero or more occurrences+ for one or more occurrences for zero or one occurrence

parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level

ANY for free-form contentEMPTY for empty element

45

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 46: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Attribute Declarations

A team element has a current attributewhich is mandatory

IMPLIED would say optional insteadand can be either yes or no

enumeration as an attribute type

ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt

46

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 47: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Attribute Defaults

IMPLIEDthe attribute is optional

REQUIREDthe attribute is mandatory

FIXEDeither it is explicitly specified or not it has a given value

literalthe default value is the literal quoted string

47

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 48: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Attribute TypesCDATA

any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS

more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces

ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document

IDan XML name unique in the document working as an identifier

IDREF IDREFSreference(s) to IDs in the documents

NOTATIONname of a notation used amp defined in the document (rare)

enumeration

48

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 49: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Other DTD Declarations etc

ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt

NOTATION declarationswho cares actually

We stop heremore only for those who need it

49

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 50: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Namespaces

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 51: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

What are Namespaces for

Distinguishdifferent XML applications may use the same names

at any scale from personal to world-widea namespace allows them to be clearly distinguished

Groupnames of elements and attributes of the same XML application can be grouped together

to be more easily recognised and handledExample set is an element in both SVG and MathML applications

what if I have to use them togethernamespaces can be used to disambiguate names

51

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 52: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Syntax for Namespace Use

Qualified namesprefix local_part

Examples of qualified namesor QNames or raw names

rdfdescription xlinktype xsltemplateUsed for both element and attribute names

52

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 53: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Associating Prefixes to URIExample

a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt

then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt

URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names

not necessarily pointing to an actually resource53

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 54: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Setting Default Namespaces

xmlns attributealone no suffix

ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt

all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace

no need for the svg prefix made explicity

54

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 55: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Internationalisation

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 56: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

What does Text Mean

ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)

character setASCII being the most (un)famous now Unicode

A character encoding determines how code points are mapped onto bytes

so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings

Any XML document is a text documentso encoding should be declared

56

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 57: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

The XML Encoding Part of the XML Declaration

ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)

See also XML-Defined Character SetsUnicode and ISO are the most used families

Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped

it is a text declaration but no longer a XML declaration

57

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 58: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart

for multi-lingual docsxmllang attribute

can be associated to any elementdetermines the language of the element

Values are to be found in ISO 639standard two letters for each language knownif not there IANA

prefix i-such as i-navajo i-klingon hellip

if not there too such as for user-defined tagsprefix x-

such as x-quenya58

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 59: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Encoding for Portability

Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability

When transmitting communicating through text-based files many errors typically occur

which are often not easy to catchXML abilities to

handle encoding precisely and accuratelyembody encoding information within each document

make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time

59

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 60: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML amp CSS

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 61: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML on Browsers

Different experiences with different browserswhen trying to visualise an XML document

XML however can be transformedto become easier to handle by standard browsers

Two main approachesWeb-based one XML + CSSXML-based one XSL

In the following we explore the XML + CSS issue

61

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 62: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Cascading Style Sheets

Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents

Standard W3Chttpw3corgStyleCSS

Goalsdescribing how to present elements of a document

spanning over a range of different mediaseparating style description from content and structure

In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning

62

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 63: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

CSS An Example

63

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 64: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed

a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet

Processing directiveto associate CSS to XML

ltxml-stylesheet type=textcss href=nomefilecss gt

CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip

No need for DTD or Schemaeven though the browser could anyway complainhellip

64

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 65: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

XML + CSS Example The XML Doc

65

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 66: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Example How Mozilla Visualises it[without CSS Style Sheet]

66

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 67: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Example How Mozilla Visualises it[with CSS Style Sheet]

67

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 68: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DOM amp SAX

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 69: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Manipulating XML Representing information in an XML Document

and presenting it somehowis not enough for most non-trivial application scenarios

Mostly we often need to manipulateaccess delete modify

parts of an XML documentwhich either may or may not be and XML file

This is typically dome through programming language of many sortsthrough ad hoc API

The most used hated deprecated widespread areDOMSAX

69

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 70: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Document Object Model

httpwwww3orgDOMstandard W3C as usual

The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents

It applies to HTML as well as XMLIt is essentially an API

standardised for Java amp ECMAScriptbut can be extended to other languages

There is no time here to go deep into DOMwe just try to understand its nature goals and scope

70

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 71: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DOM amp LevelsDOM views an XML tree as a data structure

similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it

maybe huge memory consumptionIt is quite large and complexhellip

Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML

Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range

Level 3 Core W3C Working Draft April 2002adds minor new features

71

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 72: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

DOM Nodes

An XML document is a treeThe tree contains nodes

one of them is a root nodenodes possibly have siblings children one parent content tag etc

The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation

It also defines which kind of child nodes they should could have

72

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 73: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes

Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice

many solutions for Javasee for instance httpjavasuncomxmljaxp

73

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 74: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()

74

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 75: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Main Problem of DOM

The XML document is loaded as a whole and handled altogether in memory

it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating

This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist

75

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76

Page 76: XML Concepts - campus.unibo.itcampus.unibo.it/3996/2/W2-xml.pdf · XML Concepts Prof. Andrea Omicini Distributed Systems / Sistemi Distribuiti L-A A.Y. 2007-2008 Alma Mater Studiorum–Università

Simple API for XML (SAX)

Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc

flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc

A very simple modelgood for simple applicationsand also to avoid memory abuse

Not so well-supported as DOM isin terms of standardisationas well as of tools

76