putting xml to work henry s. thompson hcrc language technology group university of edinburgh

133
Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Upload: lewis-arnold

Post on 26-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Putting XML to Work

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 2: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

2

Overview of the tutorial What is an XML application?

Content, Form, Function Namespaces

Ownership of names XSL(T)

Style for XML DOM

Standard abstract API for XML XML Schema

Specifying the structure of document families RDF

Defining and using Data Models

2When you see this, it means there’s accompanying information in the Additional Materials handbook

Page 3: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

3What is an XML application?

Putting XML to work means designing an XML application

SGML defines an application as having A syntax: what do all the documents involved

in this application share in terms of structure == markup?

A semantics: what do the components of that markup mean

You already know the basic story about defining a syntax You can use English (or French or . . .) You should use a DTD Or even better a Schema

Page 4: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

4

An aside about structure Formal definitions of document structure

are a kind of contract The user undertakes to structure his/her

documents per the DTD The application undertakes to process

documents which conform to the DTD As is the case with real contracts, both

sides benefit by using them Users know what they have to do to get the

results they want Developers can depend on parsers for a lot

Page 5: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

5

The W3C

A word from our sponsors The W3C is responsible for all the XML

family The W3C is The World Wide Web

Consortium, a voluntary association of companies and non-profit organisations. Membership costs serious money, confers voting rights. Complex procedures, with the Director (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.

Page 6: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

6

. . . and its WGs

The XML recommendation was written by the W3C’s XML Working Group

Which split itself into pieces, each of which handles a part of the ongoing work Core WG (XML itself, Namespaces, Infoset) Schema WG (XML Schema) Linking WG (XLink, XPointer) Query WG (XML Query) Protocols WG (XML Protocols) XSL WG (XSLT, XSL-FO)

Page 7: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

XML Namespaces

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 8: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

8

Namespaces for XML

Where did those colons come from? xsl:this, fo:that, xml:the_other

Two communities pushed for namespaces Vendors, to manage the composition of

document fragments– E.g. the inclusion of mathematical formulae

in a document Working groups, to reserve names

without compromising users' freedom to name things– E.g. it wouldn't do for XML-link to reserve LINK for simple links, or XSL to reserve TEXT

Page 9: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

9

Namespaces, cont'd

A W3C Recommendation was endorsed in January 1999 There was a lot of vendor pressure to

get something in place, which caused political tension and at least one resignation from the WG

The example illustrates how namespaces are declared, scoped and used

4

Page 10: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

10

Namespaces defined

You can use prefixed names, consisting of two simple names separated by a colon (:)

The namespace prefix is an abbreviation for a URI which uniquely identifies the owner/meaning/identity of the source of the name

Using a namespace essentially cedes responsibility for the meaning of the qualified names to the owner of the URI

Page 11: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

11

Declaring a namespace The association between namespace

prefixes and URIs is declared using reserved attributes<doc xmlns:mml='http://www.w3.org/TR/REC-MathML/'>...</doc>

Anywhere inside the above doc element mml is a legal namespace prefix, standing for the URI given

There is also a mechanism for defining the default (unprefixed) namespace

Declarations are scoped Qualified names can be used for

Element type names Attribute names

Page 12: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

12

Namespace limitations An add-on for, not a rewrite of, the XML

spec Validation is unchanged

Declarations must match instances character by character

Indeed there's no place for associating prefixes with URIs in DTDs

There is no provision for merging DTDs The rules are confusing

Unprefixed attributes are never qualified Unprefixed elements are qualified if and

only if there is a default namespace declaration in scope

Page 13: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

From Structure to Appearance:

Style for XML

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 14: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

14

Overview of the material

Why a style language? Two approaches to style for XML:

CSS for simple cases XSL for complex cases

Hands-on Exercises

2When you see this, it means there’s accompanying information in the Additional Materials handbook

Page 15: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

15

Why a style language?

Separating form from content Separating structure from

appearance Single source, multiple delivery

media

Page 16: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

16

Three stages on the way Document Compilers: ASCII text with

formatting instructions and body text intermixed nroff, Scribe, TeX

WYSIWYG Word Processors: Out-of-band formatting instructions change appearance on-screen; proprietary file formats. Word, Word Perfect

(Semi-)Structured Markup: Markup has either intrinsic or extrinsic rendering consequences. SGML, HTML, XML

Page 17: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

17

Is this progress?

The old document compilers had complex procedural semantics, which

made debugging and maintenance very tricky for documents of any sophistication.

made authoring and reading tedious, with obtrusive annotations everywhere.

The use of scoped annotations in Scribe and TeX was a big improvement over _roff, but the annotations were still resolutely about appearance, not structure.

LaTeX tried to fix this, but paid an unacceptable price in terms of complexity and fragility.

Page 18: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

18

Is this progress?, cont'd

The WYSIWYG systems are lovely to look at, and there's no

problem with obtrusive annotations. but even with the addition of

paragraph and character styles, generalisation and consistency are hard to come by.

and there's the built-in obsolescence of proprietary formats to worry about.

Page 19: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

19

SGML . . .

SGML solved the proprietary format problem It's an ISO standard (8879) It's human-readable (and

understandable!) But for a long time there was no

standard way of formatting SGML documents for printing or viewing

Page 20: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

20

. . . and HTML

So HTML (nearly/post-hoc an SGML application), by mandating a rendering semantics for all its semi-structural markup, filled a real need.

But it was not extensible (fixed tag set) not customisable (fixed appearance per

tag)

Page 21: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Three Problems; Three Solutions: Electronic Style! Style standard for SGML?

DSSSL Customise HTML page appearance?

CSS Extend HTML tag-set and control

style? XML and

– CSS– XSL

Technology Appraisals

Henry S. ThompsonStyle for XML, London 1998-11-25

9

Page 22: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

22

Cascading Style Sheets Level 1 Accepted Recommendation per

W3C, December 1996 Level 2 Accepted Recommendation, May

1998 Addresses the problems of:

customising the appearance of HTML documents

minimal styling for XML Initially driven by the need for site

designers to differentiate the appearance of their pages from one another

Focus accordingly is on controlling the colour, size and shape of regions and fonts

Page 23: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

23

A pretty CSS example

Page 24: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

24

CSS rules CSS style rules associate properties with

elements in your documents which match selectors

The basic structure of a rule looks like this:selector[, selector ...] {pname: pvalue[; pname: pvalue ...]}

Simple examples:verbatim {white-space: pre}H1 {text-align: center; font-variant: small-caps}

The first would provide style for an XML doc't

The second would change HTML's H1 6

Page 25: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

25CSS: Cascading Style Sheets

Customising HTML formatting <P> elements by means of

simple instruction:

Formatting XML formatting <foobar> elements by

means of similar instruction:

P {font-weight: bold; font-size: 14pt; font-family: sans-serif}

foobar {display: block; border-style: solid; background-color: green}

Page 26: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

26Associating rules with documents

Contents of STYLE element in the HTML header

Destination of an appropriate LINK element

In STYLE attributes on any HTML element

6

7

Page 27: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

27

CSS selectors

Rules can have one or more selectors, separated with commas

Simple names select elements by name In addition to element type names, other

selector syntax includes Space-separated lists, indicating (non-

immediate) ancestry Qualification with period or hash, indicating

class or id attribute matching Qualification with colon (pseudo-classes), for

link state and typographic sensitivity

Page 28: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

28CSS: Cascading Style Sheets

Store CSS instructions in separate style file<?xml version="1.0" standalone="yes"?>

<?xml-stylesheet type="text/css" href="mystyle.css"?><article> <title>An example</title><text><quote>It was the best of times, it was the worst of times</quote>, wrote <author>Charles Dickens</author> in <book>Tale of Two Cities</book>.</text></article>

Page 29: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

29Using classes to get ready for XML

You can cheat with your HTML to make it look more like XML<DIV CLASS='MESSAGE'>Some text which is<SPAN CLASS='EMPH'>really</SPAN> important.</DIV>

And use class selectors in your stylesheet.MESSAGE {display:block; margin-top: 6pt}.EMPH {display:inline; font-style:italic}

8

Page 30: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

30CSS selectors: Vertical context

Sometimes you need context-sensitive selectors

For depth-sensitive renderingOL {list-style-type: lower-alpha}OL OL {list-style-type: lower-roman}

For context-appropriate renderingH1 {font-weight: bold;font-size: large}H2 {font-weight: bold;font-style: italic}H3 {font-style: italic}H2 EM,H3 EM {font-style: normal}

Note that in the last rule we have two selectors, separated by commas, sharing the same result

Page 31: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

31

CSS boxes

CSS and HTML 4.0 et seq use a nested-boxes rendering model, and every block element is rendered into a box

Boxes all have margins, borders and padding (outside in)

All four margins and paddings (left-,right-, top-, bottom-) have width properties, and a shorthand property for setting them all together

Page 32: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

32

CSS borders

Borders, in addition to widths, have colours and styles, plus shorthand properties for various combinations

There are also float and clear properties to allow a modest amount of displacement and flow-around.

CSS2 goes a lot further with this

Page 33: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

33

P { margin: 3ex; border-width: thin; border-style: solid; border-left: double; text-align: justify; border-color: blue; padding: 2ex 4ex}

gives the following for a sample paragraph

CSS box example

Page 34: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

34

CSS property values

Some are symbolic, e.g. font-style: italic

URLs appear in a few places, e.g. background-image: url(http://www...)

Most are lengths, e.g. 3em, 2px percentages, e.g. 110% numbers colours, e.g. red, #fd0

Page 35: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

35

CSS

In HTML you can “invent” your own tags using classes

And you can define how they should be rendered

<div class="newsarticle">Here is some text. It’s a news article. It mentions <span class="company">Reuters</span>.</div>

<STYLE>div.newsarticle {text-align: left; font-style: italic}span.company {color: red}</STYLE>

Page 36: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

36

Exercise: >> cat exa12.xml

<?xml version="1.0" standalone="yes"?><newsarticle><headline>A newsarticle.</headline><author>Marc Moens</author><newsbody>Here is some text. It's a news article. It mentions<company>Reuters</company>. And since<company>Reuters</company> is a company, we would like it to come out slightly differently.</newsbody></newsarticle>

Page 37: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

37

Exercise: >> cat exa12.xml Change this into an HTML file, keeping

our own “invented” tags (like “headline”, “newsbody”, “company”) as classes For example

<headline> <div class="headline"> call it exa12.html See slide 29 for ideas

Put the style instructions in a separate file call it exa12.css See page 7 for how to connect the two files

7

Page 38: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

38

Solution: >> exa12.html

<HTML><HEAD><TITLE>A simple example</TITLE><LINK rel=stylesheet type="text/css" href="exa12.css"></HEAD><BODY><DIV class="headline">A newsarticle</DIV><DIV class="author">Marc Moens</DIV><DIV class="newsbody">Here is some text. It’s a news article. It mentions<SPAN class="company">Technology Appraisals</SPAN>. And since <SPAN class="company">Technology Appraisals</SPAN> is a company, we would like it to come out slightly differently.</DIV></BODY></HTML>

Page 39: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

39

Solution: >> exa12.css

div.headline {text-align: center; color: blue; border-style: dashed; font-size: xx-large}div.author {text-align: center; color: blue; font-size: large}div.newsbody {text-align: left; font-style: italic}span.company {color: red}

Page 40: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

40

Solution in the Browser

Page 41: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

41

Style

This was CSS as applied to HTML Later:

CSS to render XML other ways of rendering XML

Page 42: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

42

The 'Cascade' in CSS

What happens when there is more than one rule which provides a value for a property on a given element?

The highest priority value assignment wins

When no assignment is found, the value is either inherited or defaulted

This explains why our original H1 example was bold

Page 43: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

43

CSS priority

A number of things contribute to determining priority Origin, in increasing order of importance

– browser– user– author

Specificity, in increasing order of importance– Number of element types– Number of CLASS selectors– Number of ID selectors

Importance, marked with !important

Page 44: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

44

CSS cascade example

The following are in increasing order of priority

LIUL LIUL OL LILI.specialOL LI.special#hotone

Page 45: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

45

CSS for XML for real

In principle, it's easy Just use your own element type names

instead of HTML's In practice

IE and Mozilla support it Functionality often insufficient for

complex document types Style sheet linkage is via a PI

<?xml-stylesheet type="text/css" href="…"?>

9

Page 46: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

46Ecise: >exa13.xml (=exa12.xml)

<?xml version="1.0" standalone="yes"?><newsarticle><headline>A newsarticle.</headline><author>Marc Moens</author><newsbody>Here is some text. It's a news article. It mentions <company>Technology Appraisals</company>. And since<company>Technology Appraisals</company> is a company, we would like it to come out slightly differently.</newsbody></newsarticle>

Page 47: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

47

Exercise: >> exa13.xml Create CSS style sheet for exa13.xml

call it exa13.css you can probably reuse material from exa12.css(if you didn’t do that exercise, use exa12style.css)

Remember to use display:block or display:inline in every style rule

Link the document to the stylesheet with<?xml-stylesheet type='text/css' href='exa13.css'?>

View it in Mozilla

Page 48: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

48

CSS: Summary

Easy to learn Also useful for HTML Works in most browsers

Page 49: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

49

What is DSSSL? An ISO standard (ISO 10179:1996) A style language

How do I format my SGML documents? A transformation language

How do I transform my SGML documents?

A hopeless acronym Document Style Semantics and

Specification Language A lost opportunity!

Sunk by webhead round-paren allergies

Page 50: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

50XSL: Extensible Stylesheet Language

A style language specifically for XML– W3C recommendation, Nov 1999

Synthesis of the best of CSS and DSSSL– DSSSL processing and formatting models– CSS properties

XSL is XML– A declarative specification of both the

"pattern" and the "action" of template rules. More generic than CSS

– style and rendering are just a special case of more general tree transformation processes

– can be used for other transformations (XSL-T)

Page 51: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

51XSL: Extensible Stylesheet Language

Main properties localised (template rules keyed to

elements) scoped (inheritance of general

characteristics) unbiased (with respect to writing

direction, language,…)– “indent paragraph” will mean

“indent on the left” for English “indent on the right” for Hebrew

– “inline” will mean left to right for English top to bottom for some Japanese text

Page 52: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

52

What is XSL for? Portable standard style specification Single source documents, multiple

delivery media Print Presentation Screen

Multiple document types, single house style

Just as much complexity as you need Single continuous scroll for screen delivery Multi-column pages with side bars etc. for

books

Page 53: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

53

What is XSL not for?

Controlling filling and line breaking: left to a low-level formatting engine

Page or line fidelity: use a photocopier!

Carefully crafted page layout: emphasis is on automatable processes

User interaction: ditto

Page 54: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

54

XSL process architecture

CSS takes document tree and decorates it with formatting properties

XSL takes a document tree and builds a new document tree which it then decorates XSL is really two languages

– a transformation language (XSLT)– a formatting language (XSL-FO)

Page 55: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

55

XSL Transformations

XSL style sheet: template rules pattern which specifies which tree it

applies to result which specifies which tree it

should output

XSL processor reads XML document and XSL stylesheet carries out the instructions in the

stylesheet (outputs a new XML document)

Page 56: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

56

XSL Transformations

From XML to XML not from/to PDF, TeX, Word,

Postscript,… you can go from XML to intermediate

language, and then different processor to Tex, Word,…

you can go from/to HTML or SGML if it’s well-formed XML

Page 57: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

57

XSL Transformations

Three places it can happen Web browser (e.g. Mozilla) is handed XML

document and stylesheet, transforms the document and presents it to the user

The server applies style sheet to document to create different format (e.g. HTML) and sends that document to the client – http://xml.apache.org/xalan-j/

A program (e.g. saxon) transforms XML document before it is placed on the server– http://sourceforge.net/projects/saxon

Page 58: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

58

Key style concepts

Modular Structuring of specs at the XML level

Localised template rules keyed to elements

Scoped Generic characteristics are inherited

Unbiased No expectations regarding national

language, writing direction or character set

Page 59: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

59

Process architecture CSS takes a document tree and decorates

it with formatting properties XSL takes a (source) document tree and

builds a new (result) document tree XSL provides a set of formatting objects

with sophisticated rendering properties If your stylesheet uses formatting objects

for its result tree, you get a result which constitutes a set of claims on appearance defined by the XSL recommendation

Page 60: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

60

Formatting objects

From the simple block, inline, graphic-rule, character, sequence

To the complex: page-master, page-set-sequence, column-master, column-set-sequence, table

Each has properties appropriate to its semantics

You can think of them as replacing and extending the CSS display property

Page 61: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

61Formatting objects and areas The semantics of formatting objects is

expressed in terms of areas Areas are rectangular regions

produced by the rendering of formatting objects

A formatting object when rendered produces one or more fixed-dimension areas, whose position is determined by a parent in the formatting object tree

There are two types of areas: inline areas display areas 12

Page 62: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

62

Filling areas The process by which a formatting

object positions the areas of its daughters in its own area is called filling

Filling is different for the two different types of areas Display areas are filled into area containers Inline areas are filled into lines

Most formatting object classes are unequivocally inlined or displayed, that is, destined to be rendered into inline or display areas respectively 12

Page 63: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

63Formatting objects and filling A flow object must specify area

containers and/or lines for its contents to be filled into

For example a page-master formatting object specifies one or more area containers, into which the display areas resulting from its children are filled

And a block flow object specifies a line, into which the inline areas resulting from its children are filled

Finally a character formatting object yields an inline area containing a glyph 13

Page 64: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

64

XSL is XML

No parentheses! XSL is notated with XML element

types DSSSL semantics without DSSSL

syntax But you can think of it more like

specifying a transformation from one DTD (the source document) to another (for a formatting object specification document)

Page 65: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

65Fundamental aspects of XSL

An iconic, declarative approach, using XML to specify both the "pattern" and the "action" of template rules.

An XML DTD for a set of formatting object types and attributes adequate to provide on-screen display as well as printing support.

Page 66: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

66

Template rules

The main component of an XSL stylesheet is the template rule

Each template rule contains A pattern, identifying the source

document elements which the rule should apply to– Like a CSS selector

A template, specifying what gets added to the result tree when this rule fires– Crucially, and unlike CSS, this typically

involves elements

Page 67: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

67

Simple rule example

<xsl:template match='div/title'> <fo:block font-weight='bold'> <xsl:apply-templates/> </fo:block></xsl:template>

T h e s . . .

div

title

Block [f-w: bold]

T h e s . . .

PatternRestriction onmatch context The el't type

to match

The for-matting objectto be created

The content of the formatting object:use the subordinate results

Page 68: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

68

XSL and CSS

We could try translate our example into CSS as follows:

div title { font-weight: bold }

But that would actually be wrong: The interpretation of nesting is different:

The XSL pattern matches title elements with div as parent, where the CSS pattern matches title elements with div as ancestor at any remove.

XSL does not require a one-to-one relation between source and destination

Page 69: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

69

Richer patterns

XSL can restrict matches based on ancestry descendants position wrt siblings attribute presence/absence/value

These are expressed in the form of path expressions, which are shared with the draft XPointer proposal

The common part is called XPath

Page 70: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

70

XPath patterns / for (root's) children // for (root's) descendants .. for parent name for matching elements @name for matching attributes [. . .] for conditions

=, != for (in)equality Numerical and boolean expressions String and number literals Special-purpose functions

– not(…), position(), last() 15

Page 71: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

71

Specificity

With all these pattern variants, what happens if two rules match?

Drawing on both DSSSL and CSS, there are a set of precedence rules

Basically, the richer the pattern, the higher precedence

If all else fails, there is a numeric priority attribute

Page 72: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

72

Iconic actions The 'action' part of a rule isn't much

like an action at all It's more like a picture of what you

want in the way of formatting objects Nesting is specified directly So you can build up quite detailed

formatting object structures The special xsl:apply-templates

element type determines where the formatting objects resulting from processing the children of the matched node should be plugged in

Page 73: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

73

Action example

This 'action' builds a rich result structure <fo:block> <fo:inline width='3cm'> <fo:sequence font-weight='bold'> <xsl:value-of select='@name'/> <xsl:text>. . . . .</xsl:text> </fo:sequence> </fo:inline> <fo:sequence font-posture='italic'> <xsl:apply-templates/> </fo:sequence> </fo:block>

Page 74: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

74

Plugging in results

demo

HTML

BODY

<x:templ match='demo'><HTML> <BODY> <x:apply-templates/> </BODY></HTML>

P

<x:templ match='para'><P> <x:apply-templates/></P>

para

T h e f . . .T h e f . . .

para

T h e s . . .

P

T h e s . . .

Page 75: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

75Where did the HTML come from?

You don't have to use formatting objects for your result tree

HTML elements are an obvious alternative

But HTML isn't XML (although there is XHTML as well …) XSLT allows you to specify output type

HTML Browser converts automatically

17

Page 76: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

76

XSLT Exercise

Copy exa15.xm to exa15.xml Look at exa15.xml with Mozilla Copy catalog.xs to catalog.xsl edit exa15.xml to contain a stylesheet

reference to catalog.xsl:<?xml-stylesheet type='text/xsl' href='catalog.xsl'?>

Look at exa15.xml again Edit catalog.xsl to provide template

rules for the components of exa15.xml so that it looks interesting

Page 77: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

77

Step by step

First we need to separate the entries: Add a template rule matching <entry>

– Look at slide 74 for a clue<xsl:template match="entry"> <P> <xsl:apply-templates/> </P></xsl:template>

A different HTML tag is wanted for <name>– Use <B> instead of <P><xsl:template match="name"> <B> <xsl:apply-templates/> </B></xsl:template>

Page 78: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

78

Step by step, cont'd Let's give the price some colour

<xsl:template match="price"> <SPAN STYLE="color: red"> <xsl:apply-templates/> </SPAN></xsl:template>

Separating form from appearance would be better:<xsl:template match="price"> <SPAN CLASS="price"> <xsl:apply-templates/> </SPAN></xsl:template>

Need to add <style> to the <head> as well

Page 79: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

79

Combining XSL and CSS Add a <style> element to the

template for the root<xsl:template match="/"> . . .<head> <style> SPAN.price {color: red} </style> . . .</xsl:template>

Page 80: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

80

Exercise, cont'd Next separate paragraphs for the

options and descriptions, and italicise the <emph>ed material

What about the options themselves? Attributes have to be accessed explicitly<xsl:template match="colors"> <SPAN CLASS="color">Colours: <xsl:value-of select="@value"/> </SPAN></xsl:template>

Note that the literal text appears in the result Handle the other options the same way

Page 81: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

81

Selection

You may not always want to just invoke processing on a node's children in the ordinary way

You can supply a select attribute on xsl:apply-templates to specify what you want processed

If all you want is the text content of an element or attribute as such Use the xsl:value-of element instead

Page 82: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

82

Example of select

<xsl:template match='/'> <HTML> <HEAD> <TITLE> <xsl:value-of select='catalog/title'/> </TITLE> </HEAD> <BODY> <xsl:apply-templates/> </BODY> </HTML></xsl:template>

17

Page 83: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

83

Reordering using select You may not even want material to

appear in the output in the same order it appears in the source, e.g. if the source was derived from a database

select can be used to reorder by pulling out first one child type, then another, etc.

<xsl:apply-templates select='a'/><xsl:apply-templates select='b'/>

All a's will end up before all b's, regardless of where they started

xsl:sort provides more detailed control

Page 84: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

84

Defaults

XSL has two default rules, similar to DSSSL's For character nodes, copy the

character For all others, a sequence of the

results of processing all children

19

Page 85: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

85

Exercise

Use the reordering technique introduced above to improve your catalog stylesheet Use the <title> material twice

– Once for the <title> inside <head>– Once for the reader to see

Display the Name, Price and Part Number in the same order for every item

Page 86: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

86

Watch out for this pitfall You might have tried this to get started<xsl:template match="entry"> <P> <xsl:apply-templates select="name"/> <xsl:apply-templates select="number"/> <xsl:apply-templates select="price"/> </P> <xsl:apply-templates/></xsl:template>

Why didn't this work as expected?

Page 87: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

87

Macros If you have a constellation of

formatting objects and attributes you will use in more than one template rule, defining a named template is good practice

Named templates can be invoked by name or pattern

<xsl:template name='ruled-para'> <xsl:sequence> <fo:graphic-rule width='2mm'/> <fo:block> <xsl:apply-templates/></fo:block> <fo:graphic-rule width='2mm'/> </xsl:sequence></xsl:template>

Page 88: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

88

More sophisticated XSL

The XSL formatting objects contain many properties They enable you to express detailed

constraints on appearance Patterns for matching and selection

also support a wide range of detail The tree construction and

formatting were originally one recommendation, separated to get XSLT out the door 20

Page 89: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

89XSLT for Transformation as such

XML to XML can be very useful DTDs change Documents can be merged

Saxon, Michael Kay’s batch implementation, is ideal here Fast Simple command line interface Supports document() for multiple

inputs

Page 90: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

90The identity transformation The core of every serious

transformation <xsl:templatematch="@*|*|comment()|processing-instruction()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy></xsl:template>

Page 91: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

91

Transform exercise

Test the identity transform> saxon exa2.xm copy.xsl

Copy copy.xsl to fix.xsl Note we're using a DTD to help

XED help you Edit it to add a template for

<price> which adds a 'currency' attribute

Try it on exa15.xml as follows> saxon exa15.xml fix.xsl

Page 92: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

92

Heavy hint

<xsl:template match="price"> <price currency="UKL"> <xsl:apply-templates select="@*|node()"/> </price> </xsl:template>

You can compute the value of attributes by using curly braces . . .<price currency="{/catalog/@currency}">. . .

Would copy the currency attribute from the catalog root to price

Page 93: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

93

Variables XSLT is a pure functional language

No state No side-effects

You can bind variables<xsl:variable name="currencySymbol"> &#163;</xsl:variable><xsl:variable name="title" select="/catalog/title"/>

And access them with a dollar-sign ($)

<xsl:value-of select="$currencySymbol"/>

Page 94: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

94Combining several documents

The document function allows access within a stylesheet to named other documents

If bound to a variable, can then be used as the starting point for a search

<xsl:variable name="catalog" select="document('exa15.xml')/*"/>. . .<xsl:value-of select="$catalog/entry[number='E102']/price"/>

Page 95: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

95Multiple document exercise Look at order.xml and fold.xsl

Can you see what fold.xsl is for? Try it:> saxon order.xml fold.xsl

Copy fold.xsl to nfold.xsl Edit it to use the catalog database

to add a name attribute to each order Use the addition of the price attribute

as a model To run this, do

> saxon order.xml nfold.xsl

Page 96: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

96

Implementations of XSL James Clark has implemented most of XSLT

Lotus, IBM and others have done so as well The best Java implementation in my view is

Michael Kay's Saxon (http://sourceforge.net/projects/saxon)

IE supports the whole language And it's much faster than the Java

implementations Mozilla’s implementation is also pretty good

Although not blindingly fast Others are implementing subsets of the

formatting semantics Usage is widespread

Page 97: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Processing XML:SAX and DOM

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 98: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

98

What is SAX Serialised Access to XML A 'fat' stream of bits from an XML document Uses the 'factory' design pattern

Programs register handlers for (subset of events)– Start tag (incl. Attributes)– End tag– Text content– …

Document is parsed (possibly validate)– Appropriate handlers are called– Program does what it likes

No tree unless you build it

Page 99: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

99

What is the DOM?

A programming-language independent interface for XML/HTML documents Object types, properties and methods for

– navigation– construction

Tree-oriented Not a data model

Different expressions of the same XML document present different pictures through the DOM

Page 100: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

100

What is the DOM, cont'd?

A core plus extension for HTML Primary expression is in OMG IDL Bindings provided for

Ecmascript (= standardised Javascript) Java C++

Page 101: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

101

The DOM approach

Not objects, but interfaces Each implementation may have its

own objects But you can build and navigate

through the standard interface

Page 102: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

102

DOM interface types

Node see below

NodeList e.g. for children of elements

NamedNodeList e.g. for attribute list

String all values

Page 103: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

103

DOM Nodes Almost everything is a sub-type of Node Different sub-types allow different types

of children They also support different attributes

and methods A partial list of types and their children

Document Element, ProcessingInstruction, Comment, DocumentType

Element Element, Text, EntityReference, . . .Attr Text, EntityReferenceDocumentType None (!)EntityReference Element, Text, . . .

Page 104: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

104

DOM Issues There are some things which are

implementation-specific Input/output Entities vs. references

There are some things which are language specific property vs. method syntax

There is some redundancy in the API nodeName is defined for most Node

subtypes– for Elements, same as tagName– for Attrs, same as name

2627

Page 105: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

105

DOM examples The same example in two languages

Javascript and Java Bold-face indicates the parallelism

All but the first are DOM components Node creation and decoration is

possible too Loading would then not be needed Both SAXDOM and IE5 define a way to

output a document– SAXDOM: writeDocument produces a text

version– IE5: text output not clear as of this writing, but

can display synthesised documents by handing off to the browser

2627

Page 106: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

106

DOM navigation

Tree-wise parentNode, firstChild, lastChild, nextSibling and previousSibling properties

By name getAttribute(name) getElementsByTagName(name) no access via ID

Page 107: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

107

SAX and DOM verdict DOM is clunky, with various infelicities

But it's more portable Whole-tree approach limits size of

document which can be processed SAX is a clear win here

Both are better than ad-hoc interfaces Quick way to get XML-based e-

commerce applications moving Java server-side (or client-side with SAX) Javascript client-side (not SAX)

Page 108: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

108XML is ASCII for the 21st century ASCII (ISO 646) solved a fundamental

interchange problem for flat text documents What bits encode what characters

– (For a pretty parochial definition of 'character') UNICODE/ISO 10646 extends that

solution to the whole world XML thought it was doing the same for

simple tree-structured documents The emphasis in the XML design was on

simplifying SGML to move it to the Web XML didn't touch SGML's architectural vision

– flexible linearisation/transfer syntax– for tree-structured documents with internal links

Page 109: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

109

The essence of XML

It's a markup language used for annotating text

It is concerned with logical structure to identify sections, titles, section

headers, chapters, paragraphs,… It is not concerned with appearance

you say 'this is a subtitle'not 'this is in bold, 14pt, centered'

you say 'this is an example'not 'this is in verbatim, indented by 5pts, ragged right'

Page 110: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

110The essence of XML, try again

It's a markup language used for transferring data

It is concerned with data models to convert between application-

appropriate and transfer-appropriate forms

It is not concerned with human beings It's produced and consumed by

programs

Page 111: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

111

Application data

Page 112: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

112

Structured markup<POORDERHDR><DATETIME qualifier="DOCUMENT"> <YEAR>1996</YEAR> <MONTH>06</MONTH> <DAY>30</DAY> <HOUR>23</HOUR> <MINUTE>59</MINUTE> <SECOND>59</SECOND> <SUBSECOND>0000</SUBSECOND> <TIMEZONE>+0100</TIMEZONE> </DATETIME> <OPERAMT qualifier="EXTENDED" type="T"> <VALUE>670000</VALUE> <NUMOFDEC>2</NUMOFDEC> <SIGN>+</SIGN> <CURRENCY>USD</CURRENCY>. . .

Page 113: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

113

What just happened!? The whole transfer syntax story just

went meta, that's what happened! XML has been a runaway success, on a

much greater scale than its designers anticipated Not for the reason they had hoped

– Because separation of form from content is right But for a reason they barely thought about

– Data must travel the web Tree structured documents are a useable

transfer syntax for just about anything So data-oriented web users think of XML as

a transfer mechanism for their data

Page 114: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Metadata for XML: RDF and XML-Link

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 115: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

115

XML and e-Business Ed Feigenbaum once described Terry

Winograd’s work as “a breakthrough in enthusiasm” I worry sometimes if XML and e-business is

vulnerable to the same criticism Negotiation between producers and

consumers is the key If you can’t describe what you want, you can’t

have it If you can’t describe what you’ve got, no-one will

use it If you can’t dicker, you’ll always lose

So as far as I can see, for e-Business to be successful the Web badly needs a solution to the metadata problem

Page 116: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

116

The problem

We're all drowning in information: our desktop machines are bursting, our intranets are vast, the World Wide Web is effectively infinite, and significantly different from one week to the next.

Traditional approaches to storage management (hierarchical file systems) and information retrieval (indexing on all words in every document) are clearly already barely coping at best, and are unlikely to meet the challenge.

Page 117: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

117

The problem, cont'd

Exponential increase in average amount of local disk space per individual.

Hyper-exponential growth in connectivity: Intranets; The Internet

Hierarchical file systems, with perhaps formerly meaningful names, are just not adequate to the organisational task which arises.

The idea, if not yet the reality, of metadata has been widely touted as the solution to these problems

Page 118: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

118

Where did I put that?

How can I find the information I need? In a document I wrote once; In a message I received once; In a message I sent once; In a document a colleague wrote; In a document somewhere on the Internet?

Traditional information-retrieval techniques are not working: Too many of the repositories (e.g.

compressed mail archives) are resistant to indexing;

Plain text indexing is too blunt an instrument.

Page 119: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

119

The solution(s)?

There's been a lot of talk about metadata.

What is metadata? It's just data. But it's data about other data That machines can read.

What could metadata do for us? Give search engines something to work

with that is designed for their needs. Give us all a place to record what a

document is for or about.

Page 120: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

120Requirements for metadata

What would we need to make this work? A standard syntax, so metadata can be

recognised as such; One or more standard vocabularies, so

search engines, authors and users all speak the same language;

Lots of documents with metadata attached;

Attributions we can believe; sources we trust.

Page 121: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

121

What is RDF?

RDF is actually two standardisation efforts, under the aegis of the W3C.

It stands for Resource Description Framework (in other words, data about data).

The two efforts are: Standardising the syntax and abstract

semantics; Providing a standard way of defining

standard vocabularies (but not actually defining any).

Page 122: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

122An aside: what is the W3C?

The World Wide Web Consortium. A voluntary association of companies and

non-profit organisations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power.

How do standards get drafted and approved? W3C Draft Recommendations come from

Working Groups with little (XML) or a lot of input from W3C staff (CSS1,2). They are approved by the Chairman.

Page 123: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

123

RDF Model and Syntax

The model is a labelled directed graph, with nodes being loci in hyperspace, and labels being atoms.

There is a notion of reification, which allows property types (= edge labels) and individual edges to themselves be described (i.e. effectively be the sources of edges)

The syntax is XML, almost but not quite expressible in a DTD

Page 124: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

124

A simple RDF example

This is expressed using one of several shorthands The about attribute of a Description

identifies the thing described Each sub-element name is a property Each sub-element either has

– content (the atomic value of its property)– a resource attribute which points to its

value[ora's home

page] Ora Lassilav:Name

[email protected]:Email

[ora himself]

s:Creator

34

Page 125: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

125

What RDF is not

RDF obviously cannot provide the third requirement: lots of meta-described documents

More subtly, RDF is not a knowledge representation system. There is no notion of what a conformant

RDF application might do other than identify and normalise metadata.

We can easily read much more into RDF annotations than is computationally there.

Page 126: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

126

An alternative answer

If RDF is just about connections between points in hyperspace, why isn't XML Link all that's needed?

Before showing how this can be done, a brief diversion into XML Link

Page 127: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

127

What is XLink

Just as XML itself simplified SGML while extending HTML

XLink simplifies HyTime while extending HTML

XLink provides mechanisms for Describing links with link elements Identifying links and link ends by type and

role Locating link ends with a powerful locator

syntax Incorporating link elements in-line or out-of-

line Specifying default behaviours

Page 128: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

128

Simple XLink example

This a simple reconstruction of HTML's A element, specifying two-ended link in-line with one implicit and one explicit locator<refr xlink:type="simple" xlink:href="http://www.w3.org/">The W3C</refr>

On the next slide is a richer example, specifying a two-ended link out-of-line with two explicit locators

Page 129: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

129More complex link example<connect xlink:type='extended'> <dutch xlink:type='locator' xlink:href='http://www.klm.nl/About/Nederlands/default.htm'>

<english xlink:type='locator' xlink:href='http://www.klm.nl/About/default.htm'>

This is a good example of hand-crafted home-page translation pairing.</connect>

Page 130: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Putting XML to Work:Conclusions

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 131: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

131

XML is moving fast

The language itself won't change much Some of the add-ons are also stable

Namespaces XSLT

Other things are pretty close to stabilising XML Link XML Schema

Some parts are much less stable XML Query, Protocols

Page 132: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

132New project today: what to use? XML with DTD, one namespace,

unless you really need SGML Ready for Schema

Author with one of the free tools Validate with RXP Render by:

Using XSL transformation to HTML+CSS– Show with IE or Mozilla online – Use XT, Xalan, SAXON offline and show with

IE4, Netscape– Use XML+CSS in simple cases, show with IE,

Mozilla(?)– For data, use XML+XSL and IE or Mozilla

Script with Java+SAX; Javascript/Python + DOM

Page 133: Putting XML to Work Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Division of Informatics Henry S. ThompsonPutting XML to Work, Edinburgh 2003-12-16

133

What will I use tomorrow?

XML+Schema and Namespaces Validate with schema-based parser Render with XSL to screen or print Build tools with SAX or DOM