xml study-session: part iii parsing xml documents

XML Study-Session: Part III

Parsing XML Documents

Objectives

By completing this study-session, you should be able to:

Learn to use the IBM XML4J Java XML parser. Gain familiarity with the Document Object Model

(DOM). Be able to create a parsing application to display,

navigate, and modify an XML document.

What is parsing?

Interpretation of text. The XML parser’s job is load the document, check

that follows all necessary rules (at minimum, for well-formedness), and build a document tree structure that can be passed on to the application.

The application is any program (e.g. browser, reader, middleware) that acts upon the tree structure, processing the data it contains.

Overview of XML parsing

XML Document

XML Parser

Application to

manipulate XML Data

Packets of parsed

XML data

Fig. 1 (from “Building XML Applications,” St. Laurent and Cerami)

Every XML application includes at least two pieces: an XML parser and an application to manipulate the parsed XML data.

XML Application

Types of parsers Validating vs. Non-validating:

• A validating parser checks a document against a declared DTD.

Tree-based vs. Event-driven interface:• Parser with tree-based interface will read entire document

and create an internal tree representation of the data which can then be traversed by the application. A standardized API for this interface is the W3C DOM.

• In the event-driven model, the parser reads through the document and signals each significant parsing event (e.g. start of document, start of element, end of element). Callback methods are used to handle these events as they occur. This approach is used by the Simple API for XML (SAX).

The IBM XML4J parser

Open source Java parser developed by IBM and now available as part of the xml.apache.org project under the codename Xerces.

Version 3.1.1 API supports DOM level 1 and SAX level 1. Can be downloaded from as .zip file from www

.alphaworks.ibm.com/tech/xml4j. Ideal for standalone Java applications and working with Java

servlets.

Setting up your environment

To use the classes in XML4J, you must set your Java CLASSPATH variable so that Java can locate the xerces.jar and xercesSamples.jar files

To set classpath in Jcreator:

Configure -> Options -> JDK Profiles -> select JDK version -> Edit -> Add Package -> add d:/xml4j/xerces.jar and d:/xml4j/xercesSamples.jar

To run/execute project with command-line arguments:

Project -> Project Settings -> JDK Tools -> Select tool type: Run Application -> select <Default> -> Edit -> Parameters -> set “Prompt for main function argument” checkbox to ‘True’.

Understanding DOM

The W3C DOM specifies an interface for treating a document as a tree of nodes.

A Node object, implemented in Java DOM, has methods such as getChildNode(), getNextSibling(), getParentNode(), getNodeType(), etc.

Possible node types in DOM include: Element, Attribute, Comment, Text, CDATA section, Entity reference, Entity, Processing Instruction, Document, Document type, Document fragment, and Notation.

Example: (petfile.xml)<?xml version=‘1.0’ encoding=‘UTF-8’?><Pets>

<Pet ID=‘001’Registered=‘030801’> <Name>Rover</Name><Age>3</Age><Description Species=‘Dog’>

Yellow colored Golden Retreiver</Description>

</Pet> <Pet ID=‘002’Registered=‘101100’>

<Name>Ella</Name><Age>1</Age><Description Species=‘Tortoise’>

Green and black shelled pond crawler</Description>

</Pet> </Pets>

Example DOM structure

Pets

Pet Pet

ID Registered Name Age Description

001 030801 Rover 3 IDYellow colored

Golden Retriever

Dog

Understanding DOM (contd.)

In XML4J, the classes that support the W3C DOM interface are stored in the org.w3c.dom class and the classes for the DOM parser are stored in the org.apache.xerces.parsers.DOMparser class.

High-level constructs such as Element and Attribute in DOM extend the Node interface. So, for instance, an Attribute object has methods such as getName() and getValue() and also getNodeName().

Complete API documentation can be found online at http://xml.apache.org/apiDocs/index.html.

Creating a parser

From the XML Reference page, download and view the FirstParser.java sample code.

This program will parse an XML document (“customer.xml”, passed as a command-line argument) and display the number of a certain element (in this case, the number of <Customer> elements) in it.

Displaying a document

From the XML Reference page, download and view the IndentingParser.java sample code.

This program will parse and display an entire XML document (passed as a command-line argument) with proper indentation.

Separate handler methods are used to handle the document (i.e. root) node, element nodes, attributes, CDATA sections, text nodes, and Processing Instruction nodes.

Navigating a document

From the XML Reference page, download and view the nav.java sample code.

This program will parse the “meetings.xml” document and navigate the tree structure to locate the name of the third person.

Note that the XML4J parser treats indented space in the XML document as text nodes. We can set the parser to ignore whitespace by calling the parser method setIncludeIgnorableWhitespace with the value ‘false’.

Modifying a document

From the XML Reference page, download and view the XMLWriter.java sample code.

This program will parse an XML document (“customer.xml”, passed as a command-line argument) and modify it by adding a new <Middle_Name>XML</Middle_Name> element to every customer.

The modified document tree is then written to a new file with the name “customer2.xml”.

Next session:

Presenting XML Documents Stylesheets Writing your own XSL applications