university of moratuwa pie 202: internet technologies html, xhtml and xml dr. ajith pasqual master...

UNIVERSITYOF MORATUWA

PIE 202: Internet Technologies

HTML, XHTML and XML

Dr. Ajith Pasqual

Master of Business Administration/Postgraduate Diploma in Information Technology

Semester 2 module


Some InformationContact Information:

Dr. Ajith Pasqual

Dept. of Electronic and Telecommunication Engineering,

University of Moratuwa,

Tel: 2650634 Ext. 3321

Email: [email protected]

Web Resources:

Web Page: http://www.ent.mrt.ac.lk/~pasqual/courses/PG/MBA/pie202


Introduction • XML stands for the eXtensible Markup Language.• It was developed by the W3C (World Wide Web

Consortium), primarily to overcome limitations in HTML (http://www.w3.org)

• HTML has been the standard language used for web-based publishing.– dozen tags in V 1.0– about 100 tags in V 4.0

• HTML has some problems– It is a loosely type language– Strong syntax checking is not done.– Grouping of tags is arbitrary

• Result of the above is that some browsers will not properly display pages.


Introduction (2)• Since HTML has grown from publishing scientific

documents to publishing almost anything, there is a growing demand for application specific tags on top of default tags.– electronic commerce applications would need tags for

product references, prices, names, addresses, and more.– Streaming would need tags to control the flow of images

and sound. – Search engines would need more precise tags for

keywords and descriptions.– Security would need tags for signing

• On the opposite side, some applications demand lesser tags– I-Mode phones (in Japan)– WAP phones– PDA browsers


Introduction (3)

• XML has been developed to address the above problems.

• But it is unlikely that XML will replace HTML. (at least in the near future)

• However there is a convergence process where HTML is heading towards XML through XHTML (stricter syntax)


Applications• Large Web site maintenance. XML would work behind the

scenes (more specifically on the server) to simplify the maintenance of HTML documents.

• Exchange of information between organizations.• Offloading and reloading of databases.• Syndicated content, where content is being made available

to different Web sites.• Electronic commerce applications where different

organizations collaborate to serve a customer.• Scientific applications with new markup languages for

mathematical and chemical formulas.• Electronic books with new markup languages to express

rights and ownership.• Handheld devices and smartphones.with new markup

languages optimized for these so-called "alternative" devices


Applications (2)

• There are two classes of applications for XML: – publishing and – data exchange (also known as application

integration).

• Data exchange applications include most electronic commerce applications


XHTMLWhat is XHTML?

• XHTML stands for EXtensible Hyper Text Markup Language

• XHTML is aimed to replace HTML

• XHTML is almost identical to HTML 4.01

• XHTML is a stricter and cleaner version of HTML

• XHTML is HTML defined as an XML application

XHTML 1.0 became an official W3C Recommendation January 26, 2000

XHTML is a combination of HTML and XML (eXtensible Markup Language).

XHTML consists of all the elements in HTML 4.01 combined with the syntax of XML


XHTML …

Why XHTML?

• many pages on the WWW contain "bad" HTML

• Different Browser technologies

• XML is a markup language where everything has to be marked up correctly, which results in "well-formed" documents.

• XML was designed to describe data and HTML was designed to display data.

• By combining HTML and XML, and their strengths, create a markup language that is useful now and in the future -

XHTML

• XHTML pages can be read by all XML enabled devices


XHTML ..Major Differences between HTML & XHTML:

• XHTML elements must be properly nested

• XHTML documents must be well-formed

• Tag names must be in lowercase

• All XHTML elements must be closed

Elements Must Be Properly Nested

In XHTML all elements must be properly nested within each other like this:

<b><i>This text is bold and italic</i></b>


XHTMLDocuments Must Be Well-formed

All XHTML elements must be nested within the <html> root element. All other elements can have sub (children) elements. Sub elements must be in pairs and correctly nested within their parent element. The basic document structure is:

<html>

<head> ... </head>

<body> ... </body>

</html>

Tag Names Must Be in Lower Case

This is because XHTML documents are XML applications. XML is case-sensitive. Tags like <br> and <BR> are interpreted as different tags


XHTML …All XHTML Elements Must Be Closed

Non-empty elements must have an end tag.

<p>This is a paragraph</p>

<p>This is another paragraph</p>

Empty Elements Must also Be Closed

Empty elements must either have an end tag or the start tag must end with />

This is a break<br />

Here comes a horizontal rule:<hr />

Here's an image <img src="happy.gif" alt="Happy face" />

For compatibility with present browsers: add an extra space before the "/" i.e. <br />


XHTML …XHTML Syntax

•Attribute names must be in lower case

•Attribute values must be quoted

•Attribute minimization is forbidden

•The id attribute replaces the name attribute

•The XHTML DTD defines mandatory elements

Attribute Names must be in Lower Case

<table width="100%">

Attribute Values must be Quoted

<table width="100%"> NOT <table width=100%>


XHTML …Attribute Minimization is Forbidden

Wrong:<dl compact> <input checked> <input readonly> <input disabled> <option selected> <frame noresize>

Correct:<dl compact="compact"> <input checked="checked"> <input readonly="readonly"> <input disabled="disabled"> <option selected="selected"> <frame noresize="noresize">

The id Attribute replaces the Name Attribute

HTML 4.01 defines a name attribute for the elements a, applet, frame, iframe, img, and map. In XHTML the name attribute is deprecated. Use id instead.<img src="picture.gif" id="picture1" /> NOT

<img src="picture.gif" name="picture1" />


XHTML …Mandatory XHTML Elements

All XHTML documents must have a DOCTYPE declaration. The html, head and body elements must be present, and the title must be present inside the head element.

This is a minimum XHTML document template:

<!DOCTYPE Doctype goes here>

<html>

<head>

<title>Title goes here</title>

</head>

<body> Body text goes here </body>

</html>

Note: The DOCTYPE declaration is not a part of the XHTML document itself. It is not an XHTML element, and it should not have a closing tag.


XHTML …The 3 Document Type Definitions

• DTD specifies the syntax of a web page in SGML.

• DTD is used by SGML applications, such as HTML, to specify rules that apply to the markup of documents of a particular type, including a set of element and entity declarations.

• XHTML is specified in an SGML document type definition or 'DTD'.

• An XHTML DTD describes in precise, computer-readable language the allowed syntax and grammar of XHTML markup

The XHTML standard defines three Document Type Definitions

• STRICT

• TRANSITIONAL (Most common)

• FRAMESET


XHTML …The <!DOCTYPE> is Mandatory

An XHTML document consists of three main parts:

• the DOCTYPE

• the Head

• the Body

The basic document structure is:

<!DOCTYPE ...>

<html>

<head>

<title>... </title>

</head>

<body> ... </body>

</html>

The DOCTYPE declaration should always be the first line in an XHTML document


XHTML …An XHTML Example

This is a simple (minimal) XHTML document:

<!DOCTYPE html

PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html>

<head>

<title>simple document</title>

</head>

<body>

<p>a simple paragraph</p>

</body>

</html>


XHTML …The DOCTYPE declaration defines the document type:

<!DOCTYPE html


XHTML 1.0 Strict

<!DOCTYPE html


Use this when you want really clean markup, free of presentational clutter. Use this together with Cascading Style Sheets.


XHTML ..XHTML 1.0 Transitional

<!DOCTYPE html

PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Use this when you need to take advantage of HTML's presentational features and when you want to support browsers that don't understand Cascading Style Sheets.

XHTML 1.0 Frameset

<!DOCTYPE html

PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

Use this when you want to use HTML Frames to partition the browser window into two or more frames.


Core XML• XML aims at answering the conflicting demands

that arrived at the W3C for the future of HTML.• On one hand, some applications need more tags,

and these tags are increasingly specialized. For example, businessmen want tags for price and product reference. Mathematicians want tags for their formulas. Chemists also want tags for formulas, but they are not the same.

• On the other hand, other applications want a simple language

• The W3C essentially made two changes to HTML:– It predefines no tags.– It is stricter.

Changes to HTML • No Predefined Tags

– Because there are no predefined tags in XML, you, the author, create the tags that you need.

<price currency=“Rs">499.50</price>

<toc xlink:href="/newsletter">ABC Co. </toc>

• The <price> tag has no equivalent in HTML

• <toc> tag can be simulated through a combination of table, hyperlink, and bold:

<table>

<tr> <td></td>

<td><a href="/newsletter"><b>ABC Co. </b></a></td> </tr>

</table>


Changes to HTML (2)• The above code represents the extensible

aspect of XML (the X in XML). • XML is extensible because it predefines no tags

but lets the author create the tags needed for his or her application.

• But this opens many questions such as the following:– How does the browser know that <toc> is equivalent

to this combination of table, hyperlink, and bold?– Can you compare different prices?– What about the current and previous generations of

browsers?– How does this simplify Web site maintenance?


Changes to HTML (3)

• Answers to the above problems:– The browsers or the Web servers use style

sheets– Prices can be compared (using API : DOM or

SAX)– XML can be made compatible with any

browser– XML enables you to concentrate on more

stable aspects of your document


Changes to HTML (4)• Stricter Syntax

– HTML has a forgiving syntax– it was decided that XML would adopt a strict syntax. – A strict syntax results in smaller, faster, and lighter

browsers

• HTML– <p>Welcome to our site!<img src=logo.jpg>

• XML– <p>Welcome to our site!– <img src="logo.jpg"/></p>

• The image tag uses a special form for so-called empty elements).


Document StructureINTERNAL MEMO

From: John Doe

To: Jack Smith

Regarding: XML at WhizBang

Have you heard of this new technology, XML? It looks promising. It is similar to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it.

We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance.

Check this web site <http://www.w3.org/XML> for more information. Also visit Que <http://www.quepublishing.com>. They have just released "XML by Example, 2nd Edition" by Benoît Marchal <http://www.marchal.com> with lots of useful information and some great examples. I have already ordered two copies!

John


Document Structure (2)• The memo is made of at least three distinct

elements:– The title– The header, including sender and recipient names as

well as the subject– The body text

• These elements are organized in relation to each other, following a structure. For example, the title indicates that this is a memo. The title is followed by the header.

• Body text itself can be further broken down this way:– Three paragraphs– Several URLs– A signature


Document Structure (3)

• This decomposition process can be continued and recognize smaller elements such as sentences, words, or even characters.

• However, these smaller elements usually add little information on the structure of the document.

• The above structure is independent from the appearance of the memo.


Document Structure (4)2.


Document Structure (6)• So what is the relationship between structure and

appearance ?• Ideally, a text is formatted to expose its structure to the

reader. • Remember TeX ?• The key to understanding XML, is that the structure of a

document is the foundation from which the appearance is deduced.

• Most file formats concentrate on the actual appearance of a document (they take great pain to ensure almost identical display on various platforms.)

• XML uses a different approach and records the structure of documents from which the formatting is automatically deduced


Document Structure (7)% memo.tex \nopagenumbers \noindent John Doe\par \noindent Jack Smith\par \noindent XML at WhizBang\par \smallskip Have you heard of this new technology, XML? It looks promising. It is similar

to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it.\par

We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance.\par

Check this web site {\url http://www.w3.org/XML} for more information. Also visit Que {\url http://www.quepublishing.com} . They have just released "XML by Example, 2nd Edition" by Benoît Marchal {\url http://www.marchal.com} with lots of useful information and some great examples. I have already ordered two copies!\par

John\par \bye


Document Structure (8)• Mark-up originates in the publishing industry. In

traditional publishing, the manuscript is annotated with layout instructions for the typesetter. These handwritten annotations are called mark-up.

• TeX represents what is known as generic coding of text documents.

• This has the following benefits:– It achieves higher portability and is more flexible. To

change the appearance of the document, it suffices to adapt the macro. By editing one macro, the change is automatically reported throughout the document. In particular, it does not require reencoding the markup, which is a time-consuming and error-prone activity.

– The markup is closer to describing the structure.



• HTML does not enforce a strict structure; in fact, HTML enforces very little structure.

• Although it is based on the structure-rich SGML, HTML has few options for organizing data.

• When the class attribute and style sheets were added to HTML it turned HTML into a generic coding language


Document Structure - SGML<!DOCTYPE memo [ <!ELEMENT memo - - (header,body)><!ELEMENT header - O ((from & to) & subject?)> <!ELEMENT body - O (para*, signature)><!ELEMENT from - O (#PCDATA)><!ELEMENT to - O (#PCDATA)> <!ELEMENT subject - O (#PCDATA)><!ELEMENT para - O ((#PCDATA | link)*)> <!ELEMENT link - - (#PCDATA)> <!ATTLIST link url CDATA #REQUIRED><!ELEMENT signature ・ O (#PCDATA)>]><memo> <header> <from>John Doe <to>Jack Smith


Document Structure – SGML(2)<subject>XML at WhizBang <body> <para>Have you heard of this new technology XML? It looks

promising. It is similar to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it. <para>We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance. <para>Check <link url="http://www.w3.org/XML">this web site</link> for more information. Also visit <link url="http://www.quepublishing.com">Que</link>. They have just released XML by Example, 2nd Edition" by <link url="http://www.marchal.com">Benoît Marchal</link> with lots of useful information and some great examples. I have already ordered two copies!

<signature>John </memo>


Document Structure - HTML <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0

Transitional//EN"><html><head><title>WhizBang Memo: XML at

WhizBang</title></head><body><table bgcolor="lightgrey" border="1" width="70%"><tr><td><table><tr><td colspan="2"><font size="+2"

face="Garamond"><b>XML at WhizBang</b></font></td></tr>

<tr><td><font face="Garamond">From:</font></td><td><font face="Garamond">John Doe</font></td></tr>

<tr><td><font face="Garamond">To:</font></td><td><font face="Garamond">Jack Smith</font></td></tr>

</table></td></tr></table>


Document Structure – HTML (2)<p><font face="Garamond">Have you heard of this new

technology, XML? It looks promising. It is similar to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it.</font></p> <p><font face="Garamond">We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance.</font></p> <p><font face="Garamond">Check <a href="http://www.w3.org/XML"> this web site</a> for more information. Also visit <a href="http://www.quepublishing.com">Que</a>. They have just released "XML by Example, 2nd Edition" by <a href="http://www.marchal.com">Benoît Marchal</a> with lots of useful information and some great examples. I have already ordered two copies!</font></p> <p><font face="Lucida Handwriting">

<i>John</i></font></p> </body></html>


Document Structure – HTML with CSS<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0

Transitional//EN"> <html> <head><title>WhizBang Memo: XML at WhizBang</title> <style> .header { background-color: lightgrey; } .subject { font-family: Garamond; font-weight: bold; font-size: larger; } .to, .from { font-family: Garamond; } .para { font-family: Garamond; } .signature { font-family: "Lucida Handwriting"; font-style: italic; } </style> </head>

<body> <table class="header" border="1" width="70%"><tr><td> <table> <tr><td colspan="2" class="subject">XML at WhizBang</td></tr> <tr> <td class="from">From:</td> <td class="from">John Doe</td> </tr> <tr> <td class="to">To:</td>

<td class="to">Jack Smith</td> </tr> </table> </td></tr></table>


Document Structure – HTML with CSS (2)<p class="para">Have you heard of this new technology,

XML? It looks promising. It is similar to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it.</p> <p class="para">We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance.</p> <p class="para">Check <a href="http://www.w3.org/XML"> this web site</a> for more information. Also visit <a href="http://www.quepublishing.com">Que</a>. They have just released "XML by Example, 2nd Edition" by <a href="http://www.marchal.com">Benoît Marchal</a> with lots of useful information and some great examples. I have already ordered two copies!</p> <p class="signature">John</p>

</body> </html>


Document Structure – XML <?xml version="1.0"?> <memo> <header> <from>John Doe</from> <to>Jack Smith</to> <subject>XML at WhizBang</subject> </header> <body> <para>Have you heard of this new technology, XML? It looks

promising. It is similar to HTML but it is extensible. All the big names (Microsoft, IBM, Oracle, Sun) are backing it.</para> <para>We could use XML to launch new e-commerce services. It is also useful for the web site: you complained it was a lot of work, apparently XML can simplify the maintenance.</para> <para>Check <link url="http://www.w3.org/XML">this web site</link> for more information. Also visit <link url="http://www.quepublishing.com">Que</link>. They have just released XML by Example, 2nd Edition" by <link url="http://www.marchal.com">Benoît Marchal</link> with lots of useful information and some great examples. I have already ordered two copies!</para> <signature>John</signature>

</body> </memo>


Applications of XML

• Main application areas:– Document applications manipulate information

primarily intended for human consumption.– Data applications manipulate information primarily

intended for software consumption

• Document Applications– The first application of XML would be document

publishing. The main advantage of XML in this arena is that XML concentrates on the structure of the document, and this makes it independent of the delivery medium


Applications – XML (2)


Applications – XML (3)Data Applications

One of the original goals of SGML was to give document management access to the software similar to that used to manage other datasets, such as databases.


Applications – XML (4)The structure of a database in XML


Database ApplicationIdentifier Name Price

P1 XML Editor $499.00

P2 DTD Editor $199.00

P3 XML Book $29.99

P4 XML Training $699.00


Database Application<?xml version="1.0"?>

<products>

<product id="p1">

<name>XML Editor</name> <price>499.00</price>

</product>

<product id="p2">

<name>DTD Editor</name> <price>199.00</price>

</product>

<product id="p3">

<name>XML Book</name> <price>29.99</price>

</product>

<product id="p4">

<name>XML Training</name> <price>699.00</price>

</product>

</products>


XML Namespace• Namespace places elements within a global naming

system.• The concept of namespace is similar to the scope of

variables in programming languages. If you declare an i variable in a function computeAverage(), the scope of i is the computeAverage() function.

• If another function, say computeMax() also declares an i variable, there is no conflict. For the compiler, the two variables are different because they are defined in different functions. They have different scopes

• Namespace is somewhat similar. Namespace makes it possible to define elements specific to a given application of XML. If another application defines elements with the same name but in a different namespace, there is no conflict.


XML Namespace<?xml version="1.0"?> <xbe:list xmlns:html="http://www.w3.org/1999/xhtml"

xmlns:xbe="http://www.psol.com/xbe2/listing1.9"> <xbe:table> <xbe:name>persons</xbe:name> <xbe:column>first-name</xbe:column> <xbe:column>last-name</xbe:column> </xbe:table> <html:table> <html:tr><html:td>Sean</html:td><html:td>Dixon</html:td></html:tr> <html:tr><html:td>Todd</html:td><html:td>Green</html:td></html:tr> <html:tr> <html:td>Benoit</html:td><html:td>Marchal</html:td> </html:tr> </html:table> </xbe:list>


XML Stylesheets

• XML is supported by two style sheet languages: XSL (XML Stylesheet Language) and CSS (Cascading Style Sheets).

• They specify how XML documents should be rendered onscreen, on paper, or in an editor.

• XSL is more powerful, but CSS is widely implemented


XML APIs : DOM & SAX

• DOM (Document Object Model) and SAX (Simple API for XML) are APIs to access XML documents.

• They allow applications to read XML documents without having to worry about the syntax.

• They are complementary: DOM is best suited for browsers and editors; SAX is best for all the rest.


XLink and XPointer• XLink and XPointer are two parts of one

standard currently under development to provide a mechanism to establish relationships and hyperlinks between documents.

<?xml version="1.0"?>

<resources xmlns:xlink="http://www.w3.org/1999/xlink">

<entry xlink:type="simple" xlink:show="replace" xlink:href="http://www.mcp.com">Que</entry>

<entry xlink:type="simple" xlink:show="replace" xlink:href="http://www.marchal.com">marchal.com</entry>

<entry xlink:type="simple" xlink:show="replace" xlink:href="http://www.informit.com">InformIT</entry>

<entry xlink:type="simple" xlink:show="replace" xlink:href="http://www.pineapplesoft.com/newsletter"> Pineapplesoft Link</entry>

</resources>


XML Software• XML Browser:

– An XML browser is used to view and print XML documents

• XML Editors– Programmer's editors, such as XML Spy (http://

www.xmlspy.com/) or XML Pro (http://www.vervet.com/), let you manipulate the XML code directly. They are powerful, but you have to know XML to use them

– WYSIWYG editors, such as XMetaL (http://www.xmetal.com/), simulate word processors. Tools in this category are ideal for end users who may not be familiar with the XML (and may not want to be).


XML Software (2)

• XML Editors ..– The tabular view of XML spy makes the

structure of the document apparent. It shows clearly how elements nest.

– In contrast, XMetaL, hides the XML code entirely. XMetaL is ideal for markup-challenged users when you want to concentrate on writing and not on the markup


XML Spy


XMetal


XML Software (3)• XML Parsers

– XML Parser allows to scan through a XML document to identify its structure and then do some processing based on that.

– One of the most popular parsers is Apache's Xerces for Java, C++, and Perl (xml.apache.org).

• XSL Processor– Publishing directly using XML can be a problem for

users who view the contents as not many browsers support XML fully.

– With XSL, it is possible to create classic HTML that works with current and former-generation browsers (and older, too) from XML documents.

– Several XSL processors are available, and one of the most popular is Apache's Xalan (xml.apache.org)


XML Syntax• XML is a set of standards to exchange and publish

information in a structured manner.• XML is a language used to describe and manipulate

documents that follow a structure. XML documents are not limited to books, articles, or Web sites. They could be used with objects from a client/server application.

• XML defines a syntax or a file format that is useful for books, articles, client/server applications and more.

• This is possible because the XML format does not dictate or enforce a particular structure. It limits itself to rules that you can use to write a tree data structure on disk.


XML Syntax (2)• An XML document is a text. XML-wise, the

document consists of character data and markup. Both are represented as text in the document.

John Doe

34 Fountain Square Plaza

Cincinnati, OH 45202

US

513-744-8889 (preferred)

513-744-7098

[email protected]

Jack Smith

513-744-3465

[email protected]

Never leave messages on his answering machine. Email instead.

XML Syntax (3)<?xml version="1.0"?><address-book> <entry> <name>John Doe</name> <address> <street>34 Fountain Square Plaza</street> <region>OH</region> <postal-code>45202</postal-code> <locality>Cincinnati</locality> <country>US</country> </address> <tel preferred="true">513-744-8889</tel> <tel>513-744-7098</tel> <email href="mailto:[email protected]"/> </entry> <entry> <name>Jack Smith</name> <tel>513-744-3465</tel> <email href="mailto:[email protected]"/> <comments>Never leave messages on his answering machine. <b>Email instead.</b></comments> </entry></address-book>


XML Syntax

XML document describing a person

<person>

<name>

<first_name>Alan</first_name> <last_name>Turing</last_name>

</name>

<profession>computer scientist</profession> <profession>mathematician</profession> <profession>cryptographer</profession>

</person>


XML Syntax

• The person element is called the parent of the name element and the three profession elements. The name element is the parent of the first_name and last_name elements. The name element and the three profession elements are sometimes called each other's siblings. The first_name and last_name elements are also siblings.

• XML gives each child exactly one parent, not two or more. Each element (except the root element) has exactly one parent element


XML Syntax


XML Syntax (4)• The plain text format and XML format carry

exactly the same information. Yet, because plain text has no markup, there is no structure information.

• Element's Start and End Tags– The building block of XML is the element. Each

element has a name and a content.– <tel>513-744-7098</tel> – The content of an element is delimited by special

markups known as a start tag and end tag. The tagging mechanism is similar to HTML, which is logical because both HTML and XML inherited their tagging mechanism from SGML


XML Syntax (5)• XML limits itself to defining what an element is and

how to mark up an element with tags. • It provides a syntax to store information according to a

structure but, unlike HTML, it does not define what the structure is.

• Names in XML– Element names must follow certain rules. Specifically

they must start with either a letter or the underscore character ("_"). The rest of the name consists of letters, digits, the underscore character, the dot (".") or a hyphen ("-"). Spaces are not allowed in names.

– Names cannot start with the string "xml", which is reserved for the XML specification itself.

– Colon (:) is reserved for namespaces.


XML Syntax (6)

• Valid XML names:– <copyright-information> <p> <base64>

<décompte.client> <firstname>

• The following are examples of invalid element names. You could not use these names in XML:– <123> <first name> <tom&jerry>

• Unlike HTML, names are case sensitive in XML. So, the following names are all different:– <address> <ADDRESS> <Address>


XML Syntax (7)Attributes• It is possible to attach additional information to

elements in the form of attributes. Attributes have a name and a value. The names follow the same rules as element names.

• Again, the syntax is similar to HTML. Elements can have zero, one, or more attributes in the start tag. The name of the attribute is separated from the value by the equal character. The value of the attribute is enclosed in double or single quotation marks.

• For example, the tel element can have a preferred attribute (for example, to indicate which phone number you should try first):


XML Syntax (8)• <tel preferred="true">513-744-8889</tel> Unlike

HTML, XML insists on the quotation marks. • An XML parser would reject the following:• <tel preferred=true>513-744-8889</tel> • Quotation marks can be either single or double

quotes. This is convenient if you need to insert single or double quotes in an attribute value.

• <confidentiality level="I don't know"> This document is not confidential. </confidentiality> or

• <confidentiality level='approved "for your eyes only"'> This document is top-secret </confidentiality>


XML Syntax (8)Empty Element• Elements that have no content are known as empty elements.

Usually (although it is not required), they have attributes.• There is a shorthand notation for empty elements: The start

and end tags merge and the slash from the end tag is added at the end of the opening tag.

• For XML, the following two empty elements are identical:• <email href="mailto:[email protected]"/>

<email href="mailto:[email protected]"></email>


XML Syntax (9)

Nesting of Elements• Elements can contain text (name), other elements

(entry), or a combination of text and elements (comments).

• The underlying data structure for XML document is the tree of elements.

• The depth of the tree has no limit, and elements can repeat.

• An element that is enclosed in another element is called a child. The element it is enclosed into is its parent.

• Each child has only one parent.


XML Syntax (10)<entry>

<name>Jack Smith</name> <tel>513-744-3465</tel> <email href="mailto:[email protected]"/> <comments>Never

leave messages on his answering machine. <b>Email instead.</b></comments>

</entry>

The entry element above has four children: name, tel, email, and comments

Root

At the root of the document there must be one and only one element. In other words, all the elements in the document must be the children of a single element.


XML Syntax


XML Syntax (11)<?xml version="1.0"?> <entry> <name>John Doe</name> <email href="mailto:[email protected]"/> </entry> <entry> <name>JackSmith</name> <email href="mailto:[email protected]"/>

</entry>

It is easy to fix the above example by introducing a new root element, such as address-book:


XML Syntax (12)

• XML Declaration– The XML declaration is the first line of the document. The

declaration identifies the document as an XML document. The declaration also lists the version of XML used in the document. For the time being, it's 1.0.

– <?xml version="1.0"?> – An XML parser can reject documents with another version

number– The declaration can contain other attributes to support special

features such as character-set encoding. • The XML declaration is optional. When a second version of XML

comes, XML declaration would most probably become mandatory.• If the declaration is included, however, it must start on the first

character of the first line of the document. The XML recommendation suggests you include the declaration in every XML document.


XML Syntax (13)

• The two major differences between HTML and XML are– XML does not define elements but it provides a

mechanism to create your own. With HTML, the W3C had defined elements for paragraphs (<p>), bold (<b>), section titles (<h1>-<h6>) and more. In XML, it's up to you, the author of the document, to create meaningful elements.

– XML is very strict. For example, every element must have a start and end tag (unless they are empty elements, but then they must follow a special rule).

XML Syntax (14)• Comments

– To insert comments in a document, enclose them between "". Comments are intended for the human reader and the XML parser ignores them.

• Unicode– Characters in XML documents follow the Unicode

standard. Unicode is a major extension to the familiar ASCII character set. It is published by the Unicode Consortium (http://www.unicode.org/). The same standard is published by the ISO as ISO/IEC 10646.

– Unicode supports all spoken languages (on Earth) as well as mathematical and other symbols. It supports English, Western European languages, Cyrillic, Japanese, Chinese, and so on.

– Unicode, to accommodate all those characters, needs 16 bits per character. Unicode characters are twice as large as their Latin-1 counterparts; that's the price to pay for international support


XML Syntax (18)• A document written in Latin-1 needs the following

XML declaration:

<?xml version="1.0" encoding="ISO-8859-1"?>

<entrée>

<nom>José Dupont</nom>

<email href="mailto:[email protected]"/>

</entrée>


XML Syntax (19)• Entities– A simple document is complete and can be

stored in just one file. Complex documents are often split among several files: the text, the accompanying graphics, and so on.

– XML, however, does not reason in terms of files. Instead, it organizes documents physically in entities. In some cases, entities are equivalent to files; in others they are not.

– Entities are inserted in the document through entity references. An entity reference is the name of the entity between an ampersand character and a semicolon.

– The XML parser replaces the entity reference with its value. If we assume we have defined an entity "us" with the value "United States" , the following two lines are strictly equivalent:

• <country>&us;</country> • <country>United States</country>


XML Syntax (20)

• XML predefines entities for its delimiters (angle brackets, quotes, and so on). These entities are used to escape the delimiters in elements or attributes content. The predefined entities are– < left-angle bracket "<" must be escaped with <– & ampersand "&" must be escaped with &– > right-angle bracket ">" must be escaped with >

in the combination ]]> in CDATA sections (see the following CDATA section)

– ' single quote "'" can be escaped with ' essentially in attribute value

– " double quote """ can be escaped with " essentially in attribute value


XML Syntax (21)

• The following is not valid because the ampersand would confuse the XML processor:– <company>Marks & Spencer</company>

Instead, it must be rewritten to escape the ampersand bracket with an & entity:

• <company>Marks & Spencer</company>


XML Syntax (22)

• Special Attributes• XML defines two attributes

– xml:space: Like Web browsers, most XML applications discard duplicated spaces. Yet, sometimes spaces are meaningful. HTML has a special element (<PRE>) to preserve spaces. This attribute tells the application what to do with spaces. If set to preserve, the application should preserve all spaces. If set to default, the application can ignore duplicate spaces.

• The following example asks the application to preserve spaces in a listing element:


XML Syntax (23)• <listing xml:space="preserve">for(String line =

reader.readLine();

null != line;

line = reader.readLine()) writer.println(line); </listing>

• xml:lang:.... It is often desirable to know in which language the content is written. This attribute records the language. For example

<p xml:lang="en-GB">What colour is it?</p> <p xml:lang="en-US">What color is it?</p>


XML Syntax (24)• Processing Instructions• Processing instructions (abbreviated PI) is a

mechanism to insert non-XML statements, such as scripts, in the document.

• At first sight, the existence of processing instructions is at odds with the XML concept that structure comes first. As we saw in the first chapter, XML processing is derived from the structure of the document, not from instructions inserted in the document.

• That's the theory, at least. In practice, there are cases where it is simpler to insert instructions rather than define complex structures. Processing instructions are a concession to reality by the standard developers.

• The xml declaration is a processing instruction• <?xml version="1.0" encoding="ISO-8859-1"?

> •


XML Syntax - CDATA

• Markup delimiters (left-angle bracket and ampersand) that appear in the content of an element must be escaped with an entity.

• For some applications, it is difficult to escape markup characters, if only because there are too many of them.

• Mathematical equations can use many left-angle brackets. It is difficult to include a scripting language in a document and to escape the angle brackets and ampersands.

• Also, it is difficult to include an XML document in an XML document.


XML Syntax

• CDATA (Character Data) sections were introduced for those cases.

• CDATA sections are delimited by "<![CDATA[" and "]]>".

• The XML parser ignores delimiters within the CDATA section, except for ]]> (which means it is not possible to include a CDATA section in another CDATA section).


XML Syntax

Example of CDATA:


<example>

<![CDATA[


<entry>

<name>John Doe</name>

<email href="mailto:[email protected]"/>

</entry>]]>

</example>


XML and Semantic

• XML alone does not define the meaning (the semantic) of the document. The element names are meaningful only to humans. They are meaningless for the XML parser.

• The parser does not know (in case of the address book example) what a name is. And it does not know the difference between a name and an address, apart from the fact that an address element has more children than a name element

• The semantic of an XML document is provided by the application


Common Errors in XML• Forgetting End Tags

– end tags are mandatory (except for empty elements). The XML processor would reject the following because street and country have no end tags:

– <address> <street>34 Fountain Square Plaza <region>OH</region> <postal-code>45202</postal-code> <locality>Cincinnati</locality> <country>US </address>

• Forgetting That XML Is Case Sensitive– XML names are case sensitive. The following two

elements are different for XML. The first one is a tel element whereas the second one is a TEL element

– <tel>513-744-7098</tel> – <TEL>513-744-7098</TEL>


Common Errors (2)• Introducing Spaces in the Name of the

Element– It is incorrect to introduce spaces in the name of

elements. The XML parser interprets spaces as the beginning of an attribute.

– The following example is not valid because address book has a space in it:

– <address book> <entry> <name>John Doe</name> <email href="mailto:[email protected]"/> </entry> </address book>

• Forgetting the Quotes for the Attribute Value– Unlike HTML, XML forces you to quote attributes.

The following is not acceptable:– <tel preferred=true>513-744-8889</tel>


Publishing• XML roots are in publishing, it's no wonder the

standard is well adapted to publishing. • The XML standard itself was published with XML.• The main advantages of using XML for publishing

are– The capability to convert XML documents to different

media: the Web, print, and more– For large document sets, the ability to enforce a

common structure that simplifies editing– The emphasis on structure means that XML documents

are better equipped to withstand the test of time, because structure is more stable than formatting (as anybody who publishes a Web site knows, fashion changes every year but the content need not be rewritten that often)


E-commerce<?xml version="1.0"?>

<Order confirm="true">

<Date>2000-03-10</Date>

<Reference>AGL153</Reference>

<DeliverBy>2000-04-10</DeliverBy>

<Buyer>

<Name>Playfield Books</Name>

<Address>

<Street>34 Fountain Square Plaza</Street>

<Locality>Cincinnati</Locality>

<PostalCode>45202</PostalCode>

<Region>OH</Region>

<Country>US</Country>

</Address>

</Buyer>


Ecommerce (2)<Seller> <Name>Macmillan Publishing</Name> <Address> <Street>201 West 103RD Street</Street> <Locality>Indianapolis</Locality> <PostalCode>46290</PostalCode> <Region>IN</Region> <Country>US</Country> </Address> </Seller> <Lines> <Product> <Code type="ISBN">0789725045</Code> <Description>XML by Example</Description> <Quantity>15</Quantity> <Price>29.99</Price> </Product> <Product> <Code type="ISBN">0672320541</Code> <Description>Applied XML Solutions</Description> <Quantity>5</Quantity> <Price>44.99</Price> </Product> </Lines></Order>


E-commerce (3)• If the electronic documents are written in XML, the markup

matches the structure of the document. E-commerce applications can scan the above invoice and recognize the product codes and the quantity ordered.

• This was the realm of EDI technologies (EDI stands for Electronic Data Interchange). The core of EDI is a major effort to standardize every commercial and administrative document (order, invoice, tax declaration, payment, catalog, and more).

• EDI, however, has traditionally focused on reducing costs. The idea was to replace the most human-intensive operations with computer systems.

• With XML and the Internet, the focus is not merely on reducing costs but increasingly on opening new markets


Namespaces in XML• XML is extensible. So it says in the name:

eXtensible Markup Language. • The problem is that extensibility does not come

free. Misused, it could be a source of problems.• In a networked environment, such as the Web,

extensibility must be managed to avoid conflicts.

• Namespaces is a solution to help manage XML extensibility.

• XML namespace is a mechanism to identify XML elements. It places the name of the elements in a more global context.


Namespaces (2)• Look at the example in resource.xml• In practice, however, documents are seldom standalone.

In a collaborative environment such as the Web, people build on one another's work. Somebody might take your list and rate it – look at example ratings.xml

• This is the same document with one new element: rating. It is often desirable to extend documents to convey new information instead of designing new ones from scratch.

• Problems occur, however, if extensions are not properly managed. Suppose somebody else decides to rate the list, but instead of quality, it rates against family criteria


Namespaces (3)• Look at pgratings.xml• This is problematic. pgratings also is an extension to

resource but it creates incompatibilities between ratings and pgratings, because both introduce a rating element.

• This is a very common problem: Two groups extend the same document in incompatible ways.

• Things get really out of hand when trying to combine both ratings in a listing.

• When building a portal, you want to present the visitor with both quality rating and parental guidance.

• The result would look like combinedratings, in which the conflict between the two rating elements is obvious.


Namespaces (4)

• The solution to above conflict is obvious: Use different element names for each concept.

• In combinedratings, we have two concepts: quality rating and parental guidance. They should have different tags.

• prefixratings renames the "quality" element as qa-rating and the "parental" element as pa-rating


Namespaces (5)

• Can the above be a perfect solution ?• No!! Coming up with prefixes is possible

only if we are aware of the conflict in advance

• Look at nsratings.xml - it uses namespaces to prevent naming clashes

• The major difference is the form of the names. In nsratings, a colon separates the name from its prefix:

• <qa:rating>5 stars</qa:rating>


Namespaces (6)• The prefix unambiguously identifies the type of rating within

this document. • However, prefixes alone do not solve problems because

anybody can create prefixes. • Therefore, different people can create incompatible prefixes

and you are back to step one except that you have moved the risk of conflicts from element names to prefixes.

• To avoid conflicts in prefixes, prefixes are declared:• <bookmarks

xmlns:pg="http://www.playfield.com/parental/en/1.0" xmlns:qa="http://www.writeit.com/quality" xmlns="http://www.pineapplesoft.com/2001/bookmark">

• The declaration associates a URI (Uniform Resource Identifier) with a prefix. This is the crux of the namespaces proposal because URIs, unlike element names or prefixes, can be made unique.


Namespaces (7)• A namespace declaration is introduced in an attribute,

starting with xmlns followed by the prefix (note that, for the declaration, the prefix comes at the end of the attribute; when used, the prefix comes first). In prefixratings, two prefixes are declared: qa and pa.

• The attribute xmlns, without a following prefix, declares the default namespace, that is, the namespace for those elements that have no attributes. In nsratings, a default namespace is also declared.

• A namespace is valid for the element on which it is declared and its content (including elements contained within the element), unless overridden by another namespace declaration with the same prefix.


Namespaces (8)

• In summary, XML namespaces is a mechanism to unambiguously identify who has developed which element. It's not much, but it is an essential service.

• The Namespace Name– The namespace name is the URI, not the prefix.

In other words, when comparing two elements, the parser uses the URIs, not the prefixes to recognize their namespaces.


Namespaces (9)• The namespace declaration associates a global

name (the URI) with the name of the element• First and foremost, the URI is only used as an

identifier. As far as XML namespaces are concerned, it need not be valid!!

• Why ? You must be able to process XML documents without a connection to the Internet.

• Example: In electronic commerce, some XML applications run on secured computers that are not connected to the Internet. It would be difficult to process XML namespaces if they had to resolve URIs.


Namespaces (10)• Solution: Use URIs to guarantee uniqueness

through domain names, but place no restrictions on the URIs.

• In particular, the URIs do not need be valid. Yhey do not need to point to a resource.

• Because URIs need not be valid, XML namespaces treats them as a string. In particular, comparisons are done character-by-character. According to this definition, the following two URIs are not identical, even though they point to the same document:– http://www.marchal.com – http://marchal.com .


Namespaces (11)• Scoping

– The namespace is valid for the element where it is declared and all the elements within its content, as illustrated in scoping.xml. In programming circles, this is referred to as scoping

– There are three namespaces declared in scopings.xml. bk is declared on the top-level element and is therefore valid for all the elements. ns is declared twice for the two rating elements, but with different URIs (corresponding to different namespaces).

• the attributes are not associated with any namespace but, as sponsored.xml illustrates, they could be.


Namespace (12)• Digital Signature : An example of Namespaces. (look at

signed.xml)• Signature and data are identified by their namespace


XML Models

• XML models refer to mechanisms that describe the structure of a document.

• The two mechanisms are– the DTD, short for Document Type Definition and – XML Schema.

• DTDs and XML Schemas ultimately serve the same objective. Both describe the structure of XML documents.

• Both are used to validate documents against their models


DTD• The DTD dates back to SGML.• It is a proven solution and it is easy to use. • DTD’s were found lacking on three issues:

– DTDs are based on 20-year-old modeling concepts. They have no support for modern design, such as object-oriented modeling.

– DTDs were designed for publishing. They are ill-suited to more recent applications of XML, in particular, data exchange and application integration.

– DTDs have their own syntax, which is incompatible with XML documents. Therefore, it is not possible to use XML tools.

• W3C has launched an effort to develop a replacement called XML Schema. Schemas support more modern modeling concepts, are better suited for data exchange and application integration, and, last but not least, are written as XML documents.


DTD (2) - Syntax

• the syntax for DTDs is different from the syntax of XML documents. Abook-dtd.xml is the address book introduced earlier but with one difference: It has a new <!DOCTYPE> statement that links the document file to its DTD.

• The <!DOCTYPE> statement is known as the Document Type Declaration (not to be confused with the DTD).


DTD (3) – Syntax ..• The <!DOCTYPE> contains the root of the

document (address-book) and the filename (or a URI) for the DTD itself (SYSTEM "abook-dtd.dtd").

• As abook-dtd.xml illustrates, if present, the document type declaration appears immediately after the XML declaration

• <!DOCTYPE address-book SYSTEM "abook-dtd.dtd">

• the DTD declares a list of elements but does not specify which one is the root. It's up to the document to select a root.


DTD (4) - Syntax• Element Declaration

– The DTD uses a special syntax to declare every object (elements, attributes, and so on) that can appear in XML documents. Let's start with element declarations.

– Element declarations take the form of an <!ELEMENT statement and contain the element name (entry) and its content model ((name,address*,tel*,fax*,email*,comments?)). The content model simply lists the possible children of the element:

– <!ELEMENT entry (name,address*,tel*,fax*,email*,comments?)>

• The plus ("+"), star ("*"), and question mark ("?")in the content model are known as occurrence indicators. They indicate whether and how elements repeat


DTD (5) – Syntax ..• An element followed by no occurrence indicator

must appear once and only once.• An element followed by a "+" character must

appear one or several times. In other words, it can repeat.

• An element followed by a "*" character can appear zero or more times. The element is optional but, if it is included, it can repeat.

• An element followed by a "?" character can appear once or not at all. It indicates that the element is optional and, if included, cannot repeat.


DTD (6) - Syntax

• The content model for entry uses occurrence indicators.

• They enforce the repetitiveness of children: Except for name, the children are optional, and all but name and comments can appear several times in the document:

• <!ELEMENT entry (name,address*,tel*,fax*,email*,comments?)>


DTD (7) – Syntax ..

• The comma (",") and vertical bar ("|") characters are connectors. They indicate the order in which the children can appear:– The "," character indicates that both elements

(on the right and the left of the comma) must appear in the same order in the document.

– The "|" character indicates that only one of the two elements on the left or right of the vertical bar can appear in the document.

• parentheses can be used to group elements on the left and right of connectors.


DTD (8) – Syntax …

• If we were to change the declaration of entry into

• <!ELEMENT entry (name,(address* | tel* | fax* | email*),comments?)>

• only one of address, tel, fax or email could appear after the name. So, an entry could have several addresses or several phone numbers but not both.


DTD (9) – Syntax ..• Keywords

– In addition to elements, the following keywords can appear in content models:

– #PCDATA means that the element can contain text. #PCDATA stands for parsed character data.

– EMPTY means that the element is an empty element.– ANY means that the element can contain any element provided that

it was declared elsewhere in the DTD. ANY is used mostly during the development of a DTD, until a more precise content has been developed

• In abook-dtd.dtd, tel is declared as text, whereas email is an empty element:– <!ELEMENT tel (#PCDATA)> – <!ELEMENT email EMPTY>

• CDATA sections can appear within #PCDATA as well. They need not be declared explicitly


DTD (10) – Syntax …

• Mixed Content– Element contents that include both elements

and #PCDATA are said to be mixed content. Those that contain only elements are said to be element content. In abook-dtd.dtd, comments has mixed content:

• <!ELEMENT comments (#PCDATA | b)*>

• The elements and #PCDATA in mixed content must always be separated by a "|" and the whole model must always repeat.


DTD (11) – Syntax …• Nonambiguous Model

– There's one additional rule: The content model must be deterministic or unambiguous.

– In plain English, it must be possible to validate a document by reading it one element at a time.

• <!ELEMENT cover ((title, author) | (title, subtitle))> – <cover><title>XML by Example</title>

<author>Benoît Marchal</author></cover> – it is not possible to decide whether the title element is part of (title,

author) or of (title, subtitle) by looking at title only (one element at a time).

• It is often possible to remove the ambiguity, as in• <!ELEMENT cover (title, (author | subtitle))>


DTD (12) – Syntax …• Attributes

– Attributes too must be declared in the DTD– <!ATTLIST email href CDATA #REQUIRED

preferred (true | false) "false"> – The declaration starts with the element name (email)

followed by one or more attribute declarations. In this example, two attributes have been declared (href and preferred). The declaration includes their type (CDATA or (true | false)) and a default value (#REQUIRED or "false").

– Attribute declaration can appear anywhere in the DTD. For readability, it is best to list attributes immediately after their corresponding element.


DTD (13) – Syntax …• The DTD provides more control over attributes than over elements.

They are broadly divided into three categories:– String attributes contain text, for example:<!ATTLIST email href CDATA #REQUIRED>– Tokenized attributes limit the content of the attribute, for

example:<!ATTLIST entry id ID #IMPLIED> – Enumerated type attributes lists acceptable value, for example:<!ATTLIST entry preferred (true | false) "false">

• The DTD predates XML namespaces, and, therefore, it does not recognize them. If your document uses namespaces, you need to declare the xmlns attributes and the element prefixes explicitly, as in– <!ELEMENT xbe2:name (#PCDATA)> <!ATTLIST xbe2:name

xmlns:xbe2 CDATA #FIXED "http://www.psol.com/xbe2/listing4.2">


DTD (14)• Relationship Between the DTD and the

Document– the DTD specifies which elements are allowed where in

the document.– the document in abook-dtd.xml is valid because it

respects its DTD. Practically, it means that, among other things, the entry elements are enclosed in an address-book; that they each contain a name; and that the address, tel, and email appear in the order specified in the DTD. Only the second entry has a comment element, but that is not a problem because comment is optional.

• Validating the Document– To validate XML documents, you need a validating

parser


XML Schema

• Schemas improve DTDs by supporting more data types and XML namespaces and adopting the familiar syntax of XML documents for the model itself.

• The concept, however, remains the same: A schema describes XML documents so that parsers can validate them.

• One of the most visible differences between DTDs and XML Schemas is that schemas are regular XML documents. Unlike DTDs, they don't rely on a special syntax


XML Schema (2)

• Simple Type Definitions– Schemas support simple and complex types. – Simple types are

• atomic (string, integer, boolean, and more),• whereas complex types aggregate simple types.

• Simple type definitions (written as simpleType elements) restrict or augment the built-in simple types. As the name implies, the restriction element limits the values of a simple type. The original type is referenced in the base attribute.


XML Schema (3)

• Complex Type Definitions– Complex type definitions take the form of a

complexType element.

– A complex type can be a sequence of elements, attributes, simple or complex content, and more.

• Simple and Complex Content– Complex type definitions may contain simpleContent

and complexContent

• Mixed Content– Mixed content is declared as a complex type with the

mixed attribute


XPath• XPath is a non-XML language for identifying

particular parts of XML documents. • XPath lets you write expressions that refer to the first

person element in a document, the seventh child element of the third person element, the ID attribute of the first person element whose contents are the string "Fred Jones", all xml-stylesheet processing instructions in the document's prolog, and so forth.

• XPath indicates nodes by position, relative position, type, content, and several other criteria.

• XSLT uses XPath expressions to match and select particular elements in the input document for copying into the output document or further processing.


Xpath (2)• XPointer uses XPath expressions to identify the

particular point in or part of an XML document to which an XLink links.

• The W3C XML Schema Language uses XPath expressions to define uniqueness and co-occurrence constraints.

• XForms relies on XPath to bind form controls to instance data, express constraints on user-entered values, and calculate values that depend on other values.

• XPath expressions can also represent numbers, strings, or Booleans

• This lets XSLT stylesheets carry out simple arithmetic for purposes such as numbering and cross-referencing figures, tables, and equations.


Xpath (3)

• String manipulation in XPath lets XSLT perform tasks such as making the title of a chapter uppercase in a headline or extracting the last two digits from a year.

• The Tree Structure of an XML Document• An XML document is a tree made up of nodes.

Some nodes contain one or more other nodes. There is exactly one root node, which ultimately contains all other nodes. XPath is a language for picking nodes and sets of nodes out of this tree.


Xpath (4)

• From the perspective of XPath, there are seven kinds of nodes: – The root node– Element nodes– Text nodes– Attribute nodes– Comment nodes– Processing-instruction nodes– Namespace nodes


Xpath (5)

Xpath (6)<?xml version="1.0"?>

<?xml-stylesheet type="application/xml" href="people.xsl"?> <!DOCTYPE people [ <!ATTLIST homepage xlink:type CDATA #FIXED "simple" xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink"> <!ATTLIST person id ID #IMPLIED> ]>

<people> <person born="1912" died="1954" id="p342"> <name> <first_name>Alan</first_name> <last_name>Turing</last_name> </name>  <profession>computer scientist</profession> <profession>mathematician</profession> <profession>cryptographer</profession> <homepage xlink:href="http://www.turing.org.uk/"/> </person> <person born="1918" died="1988" id="p4567"> <name> <first_name>Richard</first_name> <middle_initial>P</middle_initial> <last_name>Feynman</last_name> </name> <profession>physicist</profession> <hobby>Playing the bongoes</hobby> </person>

</people>


Xpath (7)• Location Paths

– The most useful XPath expression is a location path. – A location path identifies a set of nodes in a document. – This set may be empty, may contain a single node, or

may contain several nodes. These can be element nodes, attribute nodes, namespace nodes, text nodes, comment nodes, processing instruction nodes, root nodes, or any combination of these.

– A location path is built out of successive location steps. Each location step is evaluated relative to a particular node in the document called the context node.

• The Root Location Path– The simplest location path is the one that selects the

root node of the document. This is simply the forward slash (/)

– / is an absolute location path because no matter what the context node is


Xpath (8)• For example, this XSLT template rule uses the XPath pattern / to

match the entire input document tree and wrap it in an html element: <xsl:template match="/">

<html><xsl:apply-templates/></html></xsl:template>

• Child Element Location Steps– The second simplest location path is a single element name. This

path selects all child elements of the context node with the specified name.

<?xml version="1.0"?> <xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="people"> <xsl:apply-templates select="person"/> </xsl:template> <xsl:template match="person">

<xsl:value-of select="name"/> </xsl:template>

</xsl:stylesheet>


Xpath (9)

• In XSLT, the context node for an XPath expression used in the select attribute of xsl:apply-templates and similar elements is the node that is currently matched

• Attribute Location Steps– Attributes are also part of XPath. To select a

particular attribute of an element, use an @ sign followed by the name of the attribute you want.


Xpath (10)• An XSLT stylesheet that uses root, child element, and

attribute location steps <?xml version="1.0"?> <xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/">

<html> <xsl:apply-templates select="people"/> </html>

</xsl:template><xsl:template match="people">

<table> <xsl:apply-templates select="person"/>

</table> </xsl:template> <xsl:template match="person">

<tr> <td><xsl:value-of select="name"/></td>

<td><xsl:value-of select="@born"/></td> <td><xsl:value-of select="@died"/></td>

</tr> </xsl:template>

</xsl:stylesheet>


Xpath (11)<html>

<table> <tr>

<td> Alan Turing </td> <td>1912</td> <td>1954</td>

</tr> <tr>

<td> Richard P Feynman </td> <td>1918</td> <td>1988</td>

</tr> </table>

</html>


Xpath (12)

The comment(), text(), and processing-instruction( ) Location Steps

• Although element, attribute, and root nodes account for 90% or more of what you need to do with XML documents, this still leaves four kinds of nodes that need to be addressed: namespace nodes, text nodes, processing-instruction nodes, and comment nodes. The other three node types have special node tests to match them. These are as follows: – comment( )– text( )– processing-instruction( )


Xpath (13)• Since comments and text nodes don't have

names, the comment( ) and text( ) node tests match any comment or text node in the context node.

• Each comment is a separate comment node.• Each text node contains the maximum possible

contiguous run of text not interrupted by any tag.• By default, XSLT stylesheets do process text nodes

but do not process comment nodes. You can add a comment template rule to an XSLT stylesheet so it will process comments too.

• For example, this template rule replaces each comment with the text "Comment Deleted" in italic: – <xsl:template match="comment( )">

• <i>Comment Deleted</i> – </xsl:template>


Xpath (14)Wildcards• Wildcards match different element and node types at the

same time. There are three of these: *, node( ), and @*. • The asterisk (*) matches any element node regardless of

name. For example, this XSLT template rule says that all elements should have their child elements processed but should not result in any output in and of themselves:

• <xsl:template match="*">– <xsl:apply-templates select="*"/>

• </xsl:template>• The * does not match attributes, text nodes, comments,

or processing-instruction nodes.


Xpath (15)• The node( ) wildcard matches not only all element

types but also text nodes, processing-instruction nodes, namespace nodes, attribute nodes, and comment nodes.

• The @* wildcard matches all attribute nodes. • For example, this XSLT template rule copies the

values of all attributes of a person element in the document into the content of an attributes element in the output: <xsl:template match="person">

<attributes><xsl:apply-templates select="@*"/></attributes>

</xsl:template>


Xpath (16)• Multiple Matches with |

<xsl:template match="first_name|last_name|profession|hobby"> <xsl:value-of select="text( )"/>

</xsl:template

• Compound Location Paths– Location steps can be combined with a forward slash (/) to make a

compound location path. Each step in the path is relative to the one that preceded it. If the path begins with /, then the first step in the path is relative to the root node. Otherwise, it's relative to the context node.

– For example, consider the XPath expression /people/person/name/first_name.

– This begins at the root node, then selects all people element children of the root node, then all person element children of those nodes, then all name children of those nodes, and finally all first_name children of those nodes


Xpath (17)• Selecting from Descendants with //

– A double forward slash (//) selects from all descendants of the context node, as well as the context node itself.

– At the beginning of an XPath expression, it selects from all descendants of the root node.

– For example, the XPath expression //name selects all name elements in the document. The expression //@id selects all the id attributes of any element in the document.

– The expression person//@id selects all the id attributes of any element contained in the person child elements of the context node, as well as the id attributes of the person elements themselves.


Xpath (18)• Selecting the Parent Element with ..

– A double period (..) indicates the parent of the current node. – For example, the XPath expression //@id identifies all id

attributes in the document. Therefore, //@id/.. identifies all elements in the document that have id attributes

• Selecting the Context Node with .– The single period (.) indicates the context node. In XSLT

this is most commonly used when you need to take the value of the currently matched node. For example, this template rule copies the content of each comment in the input document to a span element in the output document:

<xsl:template match="comment( )"> <span class="comment"><xsl:value-of select="."></span>

</xsl:template>


XLink• XLinks are an attribute-based syntax for attaching links

to XML documents.• XLinks can be simple Point A-to-Point B links, like the

links you're accustomed to from HTML's A element. • XLinks can also be bidirectional, linking two documents

in both directions so you can go from A to B or B to A. • XLinks can even be multidirectional, presenting many

different paths between any number of XML documents.

• The documents don't have to be XML documents. Links can be placed in an XML document that lists connections between other documents that may or may not be XML documents themselves.


Xlink (2)• Simple Links

– A simple link defines a one-way connection between two resources.

– The source or starting resource of the connection is the link element itself.

– The target or ending resource of the connection is identified by a Uniform Resource Identifier (URI).

– The link goes from the starting resource to the ending resource.– The starting resource is always an XML element. – The ending resource may be an XML document, a particular

element in an XML document, a group of elements in an XML document, a span of text in an XML document, or something that isn't a part of an XML document, such as an MPEG movie or a PDF file. The URI may be something other than a URL, for instance a book ISBN number like urn:isbn:1565922247.


Xlink (3)

<novel> – <title>The Wonderful Wizard of Oz</title>

<author>L. Frank Baum</author> <year>1900</year>

</novel> • A simple XLink is encoded in an XML document

as an element of arbitrary type that has an xlink:type attribute with the value simple and an xlink:href attribute whose value is the URI of the link target. The xlink prefix must be mapped to the http://www.w3.org/1999/xlink namespace URI


Xlink (4)<novel xmlns:xlink= "http://www.w3.org/1999/xlink" xlink:type =

"simple"

xlink:href = "ftp://archive.org/pub/etext/etext93/wizoz10.txt"> <title>The Wonderful Wizard of Oz</title>

<author>L. Frank Baum</author>

<year>1900</year>

</novel> • This establishes a simple link from this novel

element to the plain text file found at ftp://archive.org/pub/etext/etext93/wizoz10.txt

• Browsers are free to interpret this link as they like.


Xlink (5)

• Every XLink element must have an xlink:type attribute telling you what kind of link (or part of a link) it is. This attribute has six possible values: – Simple

– Extended

– Locator

– Arc

– Title

– Resource

• Simple XLinks are the only ones that are really similar to HTML links


Xlink (6)<novel xmlns:xlink= "http://www.w3.org/1999/xlink"

xlink:type = "simple“ xlink:href = "urn:isbn:0688069444"> <title>The Wonderful Wizard of Oz</title> <author>L. Frank Baum</author> <year>1900</year>

</novel>

• The xlink:href attribute identifies the resource being linked to.• It always contains a URI. • Both relative and absolute URLs can be used, as they are in

HTML links. However, the URI need not be a URL.• For example, the above link identifies but does not locate the

print edition of The Wonderful Wizard of Oz with the ISBN number 0688069444:


XPointer

• XPointers are a non-XML syntax for identifying locations inside XML documents.

• An XPointer is attached to the end of the URI as its fragment identifier to indicate a particular part of an XML document rather than the entire document

• HTML:– <a name="download"></a>– http://java.sun.com:80/products/jndi/

index.html#download


Xpointer (2)

• Named anchors in HTML has one major drawback:• to link to a particular point of a particular document,

you must be able to modify the document to which you're linking in order to insert a named anchor at the point to which you want to link.

• XPointer endeavors to eliminate this restriction by allowing you to specify where you want to link to using full XPath expressions as fragment identifiers.

• Furthermore, XPointer expands on XPath by providing operations to select particular points in or ranges of an XML document that do not necessarily coincide with any one node or set of nodes. For instance, an XPointer can describe the range of text currently selected by the mouse.


Xpointer (3)

• The most basic form of XPointer is simply an XPath expression ・ often, though not necessarily, a location path enclosed in the parentheses of xpointer( ).

• For example, these are all acceptable XPointers: – xpointer(/)– xpointer(//first_name)– xpointer(id('sec-intro'))

xpointer(/people/person/name/first_name/text( )) xpointer(//middle_initial[position( )=1]/../first_name) xpointer(//profession[.="physicist"]) xpointer(/child::people/child::person[@index<4000]) xpointer(/child::people/child::person/attribute::id)


Xpointer (4)• If you're uncertain whether a given XPointer will locate

something, you can back it up with an alternative XPointer. • For example, this XPointer looks first for first_name elements.

However, if it doesn't find any, it looks for last_name elements instead:

• xpointer(//first_name)xpointer(//last_name) • The last_name elements will be found only if there are no

first_name elements. You can string as many of these XPointer parts together as you like.

• XPointers in Links– if you wanted a URL that pointed to the first name element in the

document at http://www.cafeconleche.org/people.xml, you would type:

– http://www.cafeconleche.org/people.xml#xpointer(//name[position( )=1])


Xpointer (5)

• XPointers are more frequently used in XLinks.

• For example, this simple link points to the first book child of the bookcoll child of the testament root element in the document at the relative URL ot.xml:

<In_the_beginning xlink:type="simple" xlink:href="ot.xml#xpointer(/testament/bookcoll/book [position( )=1])"> Genesis

</In_the_beginning>


Cascading Style Sheets (CSS)• The names of most elements describe the semantic

meaning of the content they contain. However, ultimately this content needs to be formatted and displayed to users.

• For this to occur, there must be a step where formatting information is applied to the XML document and the semantic markup is transformed into presentational markup.

• There are a variety of choices for the syntax of this presentation layer. However, two are particularly noteworthy: – Cascading Style Sheets (CSS)– XSL Formatting Objects (XSL-FO)


CSS (2)• CSS is a non-XML syntax for describing the

appearance of particular elements in a document. • CSS is a very straight-forward language. No

transformation is performed. The parsed character data of the document is presented more or less exactly as it appears in the XML document,

• A CSS stylesheet does not change the markup of an XML document at all; it merely applies styles to the content that already exists

• By way of contrast, XSL-FO is a complete XML application for describing the layout of text on a page.


CSS (3)• It has elements that represent pages, blocks of text

on the pages, graphics, horizontal rules, and more. • One does not normally work with this application

directly. Instead, one can write an XSLT stylesheet that transforms the document's native markup into XSL-FO.

• The application rendering the document reads the XSL-FO and displays it to the user.

• CSS Level 2 is the current recommendation and the version of CSS

• CSS Level 2 places XML on an equal footing with HTML.


CSS (4)A semantically tagged XML document after application

of a CSS stylesheet


CSS (5)• This stylesheet (receipe.css) has four style rules.• Each rule names the element(s) it formats and follows that

with a pair of curly braces containing the style properties to apply to those elements.

• Each property has a name such as font-family and a value such as "New York", "Times New Roman", serif.

• Properties are separated from each other by semicolons. • Neither the names nor the values are case sensitive. That is,

font-family is the same as FONT-FAMILY or Font-Family. • CSS Level 2 defines over 100 different style properties.

However, you don't need to know all of these. Reasonable default values are provided for all the properties you don't set.


CSS (6)• For example, the first rule applies to the recipe element and

says that it should be formatted using the New York font at a 12 point size. If New York isn't available, then Times New Roman will be chosen instead; if that isn't available, then any convenient serif font will suffice.

• These styles also apply to all descendants of the recipe element; that is, the styles cascade down the tree. Since recipe is the root element, this sets the default font for the entire document.

• The second rule makes the dish element look like a heading, as you can see in rendered document.

• It's set to a much larger sans serif font and made bold and centered besides. Furthermore, its display style is set to block. This means there'll be a line break between the dish and its next and previous sibling elements.


CSS (7)• The third rule formats the ingredients as a bulleted list, while

the fourth rule formats both the directions and story elements as more-or-less straight-forward paragraphs with a little extra whitespace around their top and left-hand sides.

• Not all the elements in the document have style rules and not all need them.

• For example, the step element is not specifically styled. Rather, it simply inherits a variety of styles from its ancestor elements directions and recipe, as well as using some defaults. A different stylesheet could add a rule for the step element that overrides the styles it inherits. For example, this rule would set its font to 10 point Palatino:

• step {font-family: Palatino, serif; font-size: 10pt }


CSS (8)

Associating Stylesheets with XML Documents• CSS stylesheets are primarily intended for use in web

pages. • Web browsers find the stylesheet for a document by

looking for xml-stylesheet processing instructions in the prolog of the XML document.

• This processing instruction should have a type pseudoattribute with the value text/css and an href pseudoattribute whose value is an absolute or relative URL locating the stylesheet document.

• <?xml-stylesheet type="text/css" href="recipe.css"?>


CSS (9)• Including the required type and href pseudoattributes, the xml-

stylesheet processing instruction can have up to six pseudoattributes: – type

This is the MIME media type of the stylesheet; text/css for CSS and application/xml (not text/xsl!) for XSLT.

– href This is the absolute or relative URL where the stylesheet can be

found. – charset

This names the character set in which the stylesheet is written, such as UTF-8 or ISO-8859-7.

– title This pseudoattribute names the stylesheet. If more than one

stylesheet is available for a document, the browser may (but is not required to) present readers with a list of the titles of the available stylesheets and ask them to choose one.


CSS (10)• media Printed pages, television screens, and computer displays are all fundamentally different media that require different styles. For example, comfortable reading on screen requires much larger fonts than on a printed page. This pseudoattribute specifies the media types this stylesheet should apply to. There are nine predefined values.

screenttytvprojectionhandheldprintbrailleauralall

By including several xml-stylesheet processing instructions, each pointing to a different stylesheet and each using a different media type, you can make a single document attractive in many different environments.


CSS (11)alternate This pseudoattribute must be assigned one of the two values yes or no. yes

means this is an alternate stylesheet, not normally used. no means this is the stylesheet that will be chosen unless the user indicates that they want a different one. The default is no.

For example, this group of xml-stylesheet processing instructions could be placed in the prolog of the recipe document to make it more accessible on a broader range of devices:

<?xml-stylesheet type="text/css" href="recipe.css" media="screen" qalternate="no" title="For Web Browsers" charset="US-ASCII"?> <?xml-stylesheet type="text/css" href="printable_recipe.css" media="print" alternate="no" title="For Printing" charset="ISO-8859-1"?>

<?xml-stylesheet type="text/css" href="big_recipe.css" media="projection" alternate="no" title="For presentations" charset="UTF-8"?>

<?xml-stylesheet type="text/css" href="tty_recipe.css" media="tty" alternate="no" title="For Lynx" charset="US-ASCII"?>

<?xml-stylesheet type="text/css" href="small_recipe.css" media="handheld"

alternate="no" title="For Palm Pilots" charset="US-ASCII"?>


CSS (12)Selectors• CSS provides limited abilities to select the elements to which a given

rule applies. • Many stylesheets only use element names and lists of element names

separated by commas, as shown in receipe.xml. • However, CSS provides some other basic selectors you can use,

though they're by no means as powerful as the XPath syntax of XSLT.

The Universal Selector• The asterisk matches any element at all; that is, it applies the rule to

everything in the document that does not have a more specific, conflicting rule. For example, this rule says that all elements in the document should use a large font:

• * {font-size: large}


CSS (13)Matching Descendants, Children, and Siblings• An element name A followed by another element name

B matches all B elements that are descendants of A elements.

• For example, this rule matches quantity elements that are descendants of ingredients elements, but not other ones that appear elsewhere in the document:

• ingredients quantity {font-size: medium} • If the two element names are separated by a greater

than sign (>), then the second element must be an immediate child of the first for the rule to apply.


CSS (14)• For example, this rule gives quantity children of

ingredient elements the same font-size as the ingredient element:

• ingredient > quantity {font-size: inherit} • If the two element names are separated by a plus sign

(+), then the second element must be the next sibling element immediately after the first element.

• For example, this style rule sets the border-top-style property for only the first story element following a directions element:

• directions + story {border-top-style: solid}


CSS (15)Attribute Selectors• Square brackets allow you to select elements with particular

attributes or attribute values.• For example, this rule hides all step elements that have an

optional attribute: • step[optional] {display: none} • This rule hides all elements that have an optional attribute

regardless of their name: • *[optional] {display: none} • An equals sign selects an element by a given attribute's value. • For example, this rule hides all step elements that have an

optional attribute with the value yes: • step[optional="yes"] {display: none}


CSS (16)• The ~= operator selects elements that contain a given

word as part of the value of a specified attribute. The word must be complete and separated from other words in the attribute value by whitespace, as in a NMTOKENS or ENTITIES attribute. That is, this is not a substring match. For example, this rule makes bold all recipe elements whose source attribute contains the word "Anderson":

• recipe[source~="Anderson"] {font-weight: bold}

• Finally, the |= operator matches against the first word in a hyphen-separated attribute value, such as Anderson-Harold or fr-CA.


CSS (17)Pseudoclass Selectors• Pseudoclass selectors match elements according to a condition

not involving their name. • There are seven of these. They are all separated from the

element name by a colon. • For example, the first-child pseudoclass matches the first child

element of the named element. When applied to receipe.xml, this rule italicizes the first, and only the first, step element:

• step:first-child {font-style: italic} • The link pseudoclass matches the named element if and only if

that element is the source of an as yet unvisited link. For example, this rule makes all links in the document blue and underlined:

• *:link {color: blue; text-decoration: underline}


CSS (18)• The visited pseudoclass applies to all visited links of the

specified type. For example, this rule marks all visited links as purple and underlined:

• *:visited {color: purple; text-decoration: underline}

• The active pseudoclass applies to all elements that the user is currently activating (for example, by clicking the mouse on). Exactly what it means to activate an element depends on the context, and indeed not all applications can activate elements.

• For example, this rule marks all active elements as red: • *:active {color: red}


CSS (19)• The linking pseudoclasses are not yet well-supported for XML

documents because most browsers don't recognize XLinks.• The hover pseudoclass applies to elements on which the

cursor is currently positioned but which the user has not yet activated.

• For example, this rule marks all these elements as green and underlined:

• *:hover {color: green; text-decoration: underline} • The focus pseudoclass applies to the element that currently

has the focus. • For example, this rule draws a one-pixel red border around the

element with the focus, assuming there is such an element: • *:focus {border: 1px solid red }


CSS (20)• Finally, the lang pseudoclass matches all

elements in the specified language as determined by the xml:lang attribute.

• For example, this rule uses the David New Hebrew font for all elements written in Hebrew (more properly, all elements whose xml:lang attribute has the value he or any subtype thereof).

• *:lang(he) {font-family: "David New Hebrew"}


CSS (21)Pseudoelement Selectors• Pseudoelement selectors match things that aren't actually

elements. Like pseudoclass selectors they're attached to an element selector by a colon. There are four of these: – first-letter– first-line– before– after

• The first-letter pseudoelement selects the first letter of an element. For example, this rule makes the first letter of the story element a drop cap:

• story:first-letter { font-size: 200%;font-weight: bold;float: left;padding-right: 3pt }


CSS (22)• The Display Property• Display is one of the most important CSS properties. This

property determines how the element will be positioned on the page.

• There are 18 legal values for this property.• However, the two primary values are inline and block. The

display property can also be used to create lists and tables, as well as to hide elements completely.

• Inline Elements• Setting the display to inline, the default value, places the

element in the next available position from left to right, much as each word in this paragraph is positioned. The text may be wrapped from one line to the next if necessary, but there won't be any hard line breaks between each inline element.


CSS (23)• In receipe.xml and receipe.css, the quantity, step,

person, city, and state elements were all formatted as inline. This didn't need to be specified explicitly because it's the default.

• Block Elements– In contrast to inline elements, an element set to display:

block is separated from its siblings, generally by a line break.

– For example, in HTML, paragraphs and headings are block elements. In receipe.{xml,css}, the dish, directions, and story elements were all formatted with display: block.

• List Elements– An element whose display property is set to list-item is also

formatted as a block-level element. – However, a bullet is inserted at the beginning of the block.


CSS (24)– The list-style-type, list-style-image, and list-style-position

properties control which character or image is used for a bullet and exactly how the list is indented. For example, this rule would format the steps as a numbered list rather than rendering them as a single paragraph:

step { display: list-item; list-style-type: decimal; list-style-position: inside }

Hidden Elements– An element whose display property is set to none is not included

in the rendered document the reader sees. It is invisible and does not occupy any space or affect the placement of other elements.

– For example, this style rule hides the story element completely: – story {display: none}