it strategy, ibs, technology & solutions [email protected] / 416.513.5656 1 xml 101: a...
TRANSCRIPT
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML 101:A Technical Introduction to XML
20 November 2002Bank of Montreal Database Users Group
Ian GRAHAM
IT Strategy, IBS, Technology and Solutions, BMO Financial Group
E: <[email protected]>
T: (416) 513.5656 / F: (416) 513.5590
To download this talk: http://www.utoronto.ca/ian/talks/
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
A base-level syntax– for encoding structured, text-based information (words, characters, ...)
A text-based syntax– XML is written using printable Unicode characters. Explicit binary data is not
allowed
Supports extensible data formats – XML lets you define your own elements (essentially data types), within the
constraints of the syntax rules
Designed as a universal format– The syntax rules ensure that all XML processing software MUST identically
handle a given piece of XML data.
If you can read and process it, so can anybody else
What is XML?
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML: A Simple Example
<?xml version="1.0" encoding="iso-8859-1"?> <partorders xmlns=“http://myco.org/Spec/partorders”> <order ref=“x23-2112-2342” date=“25aug1999-12:34:23h”> <desc> Gold sprockel grommets, with matching hamster </desc> <part number=“23-23221-a12” /> <quantity units=“gross”> 12 </quantity> <deliveryDate date=“27aug1999-12:00h” /> </order> <order ref=“x23-2112-2342” date=“25aug1999-12:34:23h”> . . . Order something else . . . </order></partorders>
XML Declaration (“this is XML”) Flags character encoding used in file
Black – XML tags and markupBlue - encoded text data
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Example Revisited
<partorders xmlns=“http://myco.org/Spec/partorders” > <order ref=“x23-2112-2342” date=“25aug1999-12:34:23h”> <desc> Gold sprockel grommets, with matching hamster </desc> <part number=“23-23221-a12” /> <quantity units=“gross”> 12 </quantity> <deliveryDate date=“27aug1999-12:00h” /> </order> <order ref=“x23-2112-2342” date=“25aug1999-12:34:23h”> . . . Order something else . . . </order></partorders>
tags attribute of thisquantity element
element
Hierarchical, structured data
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Data Model - A Tree
<partorders xmlns="...">
<order date="..."
ref="...">
<desc> ..text..
</desc>
<part />
<quantity />
<delivery-date />
</order>
<order ref=".." .../>
</partorders>
text
partorders
order
order
desc
part
quantity
delivery-date
date=
ref=
date=
ref=
xmlns=
text
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML: Design goals
Simple but reliable– Strict syntax rules, to eliminate syntax errors– syntax defines structure (hierarchically), and names structural parts
(element names) -- it is self-describing data
Extensible and ‘mixable’– Can create your own language of tags/elements – Can mix one language with another, and still reliably separate /
process the data
Designed for a distributed environment – Can have remote (‘webbed’) data, and retrieve and use it reliably
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
The parser must verify that the XML is syntactically correct Such data is said to be well-formed
– The minimal requirement to “be” XML
A parser MUST stop processing if the data isn’t well-formed– E.g., stop processing and “throw an exception” to the XML-based
application. The XML 1.0 spec requires this behaviour
XML Processing: The XML Parser
XML data XMLparser
parserInterface
XML-basedapplication
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Special Issues: Characters and Charsets
XML specification defines characters allowed as whitespace in tags: <element id = “23.112” />
You cannot use EBCIDIC character ‘NEL’ as whitespace– Must make sure to not do so!
What if you want to include characters not defined in the encoding charset (e.g., Greek characters in an ISO-Latin-1 document):
– Use character references. For example: ♠ -- the spades character ()
9824th character in the Unicode character set
Also, a reminder that binary data is forbidden– must be encoded as printable characters (e.g. using Base64)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
– A DTD can define external parts (entities) to be ‘included’ in– But …. what if the parser can’t find the external parts (firewall?)? – That depends on the type: there are two types of XML parsers
• one that MUST retrieve all parts• one that can ignore them (if it can’t find them)
Parsers and DTDs
XML dataparser
parserinterface
XML-basedapplication
DTD
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Two types of XML parsers
Validating– Must retrieve all entities and process all of the DTD. Will stop
processing and indicate a failure if it cannot– It must also test and verify other things in the DTD -- instructions that
define syntactic document rules (allowed elements, attributes, etc.).
Non-validating (well-formed only) – Tries retrieve all ‘parts’, but will cease processing the DTD content
at the first part (entity) it can’t find, – But this is not an error -- the parser simply makes available the XML
data (and the names of any unresolved ‘parts’) to the application.
Application behavior will depend on parser type
Many parsers can operate in either mode (config)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Defining constraints / languages
Two ways of doing so:– XML Document Type Declaration (DTD) -- Part of core XML spec.– XML Schema (often called XSD) -- New specification (2001), which
allows for richer constraints on XML documents.
What DTDs and/or schema specify: – Allowed element and attribute names, hierarchical nesting rules;
element content/type restrictions
Adding dialect specifications implies two classes of XML data– Well-formed XML that is syntactically correct– Valid XML that is well-formed and consistent with
a specific DTD (or Schema)
Schemas are more powerful than DTDs– Often used for type validation, or for defining low-level type
constraints (integer, varchar, datetime, etc.) constraints on values.
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
DTD Example
<!DOCTYPE transfers [ <!ELEMENT transfers (fundsTransfer)+ > <!ELEMENT fundsTransfer (from, to) > <!ATTLIST fundsTransfer date CDATA #REQUIRED> <!ELEMENT from (amount, transitID?, accountID, acknowledgeReceipt ) > <!ATTLIST from type (intrabank|internal|other) #REQUIRED> <!ELEMENT amount (#PCDATA) > . . . Omitted DTD content . . . <!ELEMENT to EMPTY > <!ATTLIST to account CDATA #REQUIRED>]><transfers> <fundsTransfer date="20010923T12:34:34Z"> . . . As with previous example . . .
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Namespaces
Mechanism for identifying different “spaces” for XML names– That is, element or attribute names
This is a way of identifying different language dialects, consisting of names that have specific semantic (and processing) meanings.
For example <key/> in one language (e.g. a security key) can be distinguised from <key/> in another language (a database key)
Mechanism uses a special xmlns attribute to define namespaces.
– The namespace is a URL string– But the URL does not reference anything in particular (there may be
nothing there!)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Mixing languages together
<?xml version= "1.0" encoding= "utf-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml1" xmlns:mt="http://www.w3.org/1998/mathml” ><head> <title> Title of XHTML Document </title></head><body><div class="myDiv"> <h1> Heading of Page </h1> <mt:mathml> <mt:title> ... MathML markup . . . </mt:mathml> <p> more html stuff goes here </p></div> </body></html>
mt: prefix indicates ‘space’ mathml (a different language)
Default ‘space’is xhtml
Namespaces let you do this relatively easily:
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Schemas
A specification for defining XML validation rules Specs: http://www.w3.org/XML/SchemaBest-practice: http://www.xfront.com/BestPracticesHomepage.html
Uses pure XML (plus namespaces) to do this
More powerful than DTDs - can specify things like integer types, date strings, real numbers in a given range, etc.
Often used for type validation, or for relating database schemas to XML models
They don’t, however, let you declare entities -- those can only be done in DTDs
The following slide shows the XML schema equivalent to our DTD
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Schema version of our DTD (Portion)
<?xml version="1.0" encoding="UTF-8"?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="accountID" type="xs:string"/> <xs:element name="acknowledgeReceipt" type="xs:string"/> <xs:complexType name="amountType"> <xs:simpleContent> <xs:restriction base="xs:string"> <xs:attribute name="currency" use="required"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="USD"/> . . . (some stuff omitted) . . . </xs:restriction> </xs:simpleType> </xs:attribute> </xs:restriction> </xs:simpleContent> </xs:complexType> <xs:complexType name="fromType"> <xs:sequence> <xs:element name="amount" type="amountType"/> <xs:element ref="transitID" minOccurs="0"/> <xs:element ref="accountID"/> <xs:element ref="acknowledgeReceipt"/> </xs:sequence> . . . And still more !!! . . .
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Software
XML parsers….. – Read in XML data, checks for syntactic (and possibly DTD/Schema)
constraints, and makes data available to an application. There are three 'generic' parser APIs
• SAX Simple API to XML (event-based)• DOM Document Object Model (object/tree based)• JDOM Java Document Object Model (object/tree based)• Pull evolving API (new) (pull-based / object +
tree)
– Lots of XML parsers and interface software available • Unix, Linux, Windows 2000/XP, Z/OS, etc
– SAX-based parsers are fast (often as fast as you can stream data)
– DOM slower, more memory intensive (create in-memory version of entire document
– Validating can be much slower than non-validating
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Parser API: SAX
A) SAX: Simple API for XML– http://www.megginson.com/SAX/index.html– An event-based interface (a push parser API)– Parser reports events whenever it sees a tag/attribute/text
node/unresolved external entity/other (driven by input stream)– Programmer attaches “event handlers” to handle the event
Advantages– Simple to use– Very fast (not doing very much before you get the tags and data)– Low memory footprint (doesn’t read an XML document entirely into
memory)
Disadvantages– Not doing very much for you -- you have to do everything yourself– Not useful if you have to dynamically modify the document once it’s in
memory (since you’ll have to do all the work to put it in memory yourself!)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Parser API: DOM
B) DOM: Document Object Model– http://www.w3.org/DOM/– An object-based interface– Parser generates an in-memory tree corresponding to the document– DOM interface defines methods for accessing and modifying the tree
Advantages– Very useful for dynamic modification of, access to the tree– Useful for querying (I.e. looking for data) that depends on the tree
structure [element.childNode("2").getAttributeValue("boobie")]– Same interface for many programming languages (C++, Java, ...)
Disadvantages– Can be slow (needs to produce the tree), and may need lots of
memory– DOM programming interface is a bit awkward, not terribly object
oriented
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
DOM Parser Processing Model
XML dataparser
parserinterface
application
text
partorders
order
order
desc
part
quantity
delivery-date
Document “object”
DOM
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Parser API: JDOM
B2) JDOM: Java Document Object Model– http://www.jdom.org– A Java-specific object-oriented interface– Parser generates an in-memory tree corresponding to the document– JDOM interface has methods for accessing and modifying the tree
Advantages– Very useful for dynamic modification of the tree– Useful for querying (I.e. looking for data) that depends on the tree
structure– Much nicer Object Oriented programming interface than DOM
Disadvantages– Can be slow (make that tree...), and can take up lots of memory– New, and not entirely cooked (but close) – Only works with Java
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Parser API: Pull
C) Pull Interfaces– http://www.xmlpull.org/ (Java); there is also a .NET pull API – An pull-parser interface – API uses expressions / methods to ‘pull’ specific chunks of XML data,
or to iterate over the XML– Can be built on top of a DOM model
Advantages– Easier to write applications that need to read in and process XML
data (‘easier’ model than a push API, in many cases)– Has proven a very popular component in the .NET toolkit
Disadvantages– Can be slow if you do lots of iteration over the XML input data– No common API across different languages (although xmlpull.org
tries to be similar to the .NET API); not yet a ‘real’ standard (still being worked on; not part of most commercial environments)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Processing: XSLT
D) XSLT eXtensible Stylesheet Language -- Transformations– http://www.w3.org/TR/xslt– An XML language for processing/transforming XML– Does tree transformations -- takes XML and an XSLT style sheet as
input, and produces a new XML document with a different structure
Advantages– Very useful for tree transformations -- much easier than DOM or SAX
for this purpose– Can be used to query a document (XSLT pulls out the part you want)
Disadvantages– Can be slow for large documents or stylesheets– Can be difficult to debug stylesheets (poor error detection; much
better if you use schemas)
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XSLT processing model
D) Processing model
XSLT style sheet in
XMLparser
XSLT processor
text
partorders
order
order
desc
part
quantity
delivery-date
document “objects” fordata and style sheet
XMLparser
XML data in
partorders
xza
order
foo bee
data out (XML)
schema
schema
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Processing Toolkits
Lots of them … Java
– JAXP ( http://java.sun.com/xml/jaxp/faq.html )dom4j ( http://www.dom4j.org ) .NET ( part of .NET framework)… … others …
Provide DOM, SAX, (JDOM) interfaces, plus lots of other useful tools in a standardized way (loading parsers, performing XSLT transformations, etc.)
JAXP is standard Java, and thus integrated with Websphere
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML and databases
So where do you stick XML data– Inside a database!?!– But how to do this – and which database type to use:
– RDBMS, ORDBMS, ODB, XML??
How you do so depends on the use cases you have for the data. Some good-to-ask questions are
– Am I talking about storing documents, or data?– Is the XML format integral to the application (e.g. XHTML, DocBook?)
– How will the database be queried?– Queried by XML structure, or by standard SQL– What ‘parts’ of the document need to be queried– Do I need a text index?
– How will the data be used/retrieved?– Passed to XML processing tools (e.g. XSLT), or used at ‘atomic’ simple type
level?
– The answers drive out – What database to choose, how to map XML to tables (O-R or table
mappings), store as BLOB or broken up …..
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML and databases
Upcoming technologies– XML Query – a query language for querying XML datasets (and
databases)• Uses XML schema for type casting, and validation• Info: http://www.w3.org/XML/Query
Useful XML Database references– http://www.xml.com/pub/a/2001/10/31/nativexmldb.html Introductory article– http://www.rpbourret.com/xml/XMLAndDatabases.htm XML and databases– http://www.rpbourret.com/xml/XMLDatabaseProds.htm Products list– http://www.xmldb.org/resources.html Docs / resource list
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Messaging
Use XML as the format for sending messages between systems Advantages:
– Common syntax; self-describing (easier to parse)– Can use common/existing transport mechanisms to “move” the XML
data (HTTP, HTTPS, SMTP (email), MQ, IIOP/(CORBA), JMS, ….)
Requirements– Shared understanding of dialects for transport (required registry
[namespace!] ) for identifying dialects– Shared acceptance of messaging contract
Disadvantages– Asynchronous transport; no guarantee of delivery, no guarantee that
partner (external) shares acceptance of contract.– Messages will be much larger than binary (10x or more) [can
compress]
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Common messaging model
XML over HTTP – Use HTTP to transport XML messages –
POST /path/to/interface.pl HTTP/1.1Referer: http://www.foo.org/myClient.htmlUser-agent: db-server-olkAccept-encoding: gzipAccept-charset: iso-8859-1, utf-8, ucsContent-type: application/xml; charset=utf-8Content-length: 13221. . .
<?xml version=“1.0” encoding=“utf-8” ?><message> . . . Markup in message . . . </message>
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Some standards for message format
Define dialects designed to “wrap” remote invocation messages
XML-RPC http://www.xmlrpc.com– Very simple way of encoding function/method call name, and passed
parameters, in an XML message.
SOAP (Simple object access protocol) http://www.soapware.org
– More complex wrapper, which lets you specify schemas for interfaces; more complex rules for handling/proxying messages, etc. This is a core component of Microsoft’s .NET strategy, and is integrated into more recent versions of Websphere and other commercial packages. W3c activity (who sets the SOAP spec) is outlined at: http://www.w3.org/2000/xp/Group/
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
XML Messaging + Processing
FactorySupplier
Supplier
Supplier
Place order(XML/edi) using
SOAP over HTTP
Response(XML/edi) using
SOAP over HTTP
SOAP interface
SOAP
Transport
XML/EDI
HTTP(S)SMTPother ...
Application
SOAP API
• XML as a universal format for data exchange
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Web “Services” Model
SOAP plus higher-level modeling for how services are ‘advertised’, ‘exposed’ and ‘found’
– Uses an XML dialect, WSDL (Web Services Description Language) to define a service
• WSDL can use XML Schema to define how data is passed between a service provider and requestor
– Uses an XML dialect, UDDI (Universal Description, Discovery and Integration) for
• Describing services (high-level)• Discovering services (registry services, metadata)• UDDI defined using XML Schema
– Core technology for application integration• Microsoft .NET• IBM Websphere• Oracle • …. Many others
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
WSDL
XML schema
skeleton
proxy
Client code
WS/SOAP
proxy
WS/SOAP
skeleton
adapter
MECH
adapter
Middle tiercode
automatedcode
generator
Writ
e th
e A
pplic
atio
n!
SOAPRequests/responses
Validation,business
logic, routing,Logging,more…
ProductSystemcode
Web Services Code Development
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
Presentation Outline
1. What is XML (basic introduction)
2. Defining language dialects and constraints– DTDs, namespaces, and schemas
3. XML processing– Parsers and parser interfaces; XML processing tools
4. XML databases– High-level issues, and references
5. XML messaging / web services– Why, and some issues/example
6. Conclusions
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
SAX 1
XML (and related) Specifications
XML 1.0 XML names
XSLT
DOM 1
‘Open’ std
XHTML 1.0
XML query ….
XML schema
SOAP UDDI
XML-RPC
SAX 2
DOM 2
JDOM
JAXP
WSDL
APIs
Style Protocols Web Services Application areas
XML Core
W3C rec
W3C draft
industry std
Xpath
XSL
MathML
SMIL 1 & 2
SVG
Modularized XHTML
XHTMLbasic
Xforms
Canonical
XMLsignature
XML base
Xlink
Xpointer
Infoset
RDF
Xfragment
XHTMLevents
FinXML
dirXML
100's more ....
DOM 3
CSS 1
CSS 2
CSS 3
IFX
FpML ...
ebXML
Biztalk
WDDX XMI...
...
…...
IT Strategy, IBS, Technology & Solutions [email protected] / 416.513.5656
The End.
Ian GRAHAM
IT Strategy, IBS, Technology and Solutions, BMO Financial Group
E: <[email protected]>
T: (416) 513.5656 / F: (416) 513.5590
XML 101:A Technical Introduction to XML