apachecon 2000 everything you ever wanted to know about xml parsing
DESCRIPTION
TRANSCRIPT
![Page 2: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/2.jpg)
Thank Youn IBMn ASFn Xerces-Dev
![Page 3: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/3.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 4: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/4.jpg)
Overviewn Focus on server side XMLn Not document processingn Benefits to servers / e-business
n Model-View separation for *MLn Ubiquitous format for data exchange
n Hope for network effects from data availability
![Page 5: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/5.jpg)
xml.apache.orgn Xerces XML parser
n Javan C++n Perl
n Xalan XSLT processorn Javan C++ (soon)
n FOPn Cocoon
![Page 6: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/6.jpg)
Xerces-Jn Why Java?
n Unicoden Code motionn Server side cross platform support
n C++ version has lagged Java version
![Page 7: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/7.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 8: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/8.jpg)
Basic XML Conceptsn Well formednessn Validity
n DTD’sn Schemas
n Entities
![Page 9: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/9.jpg)
Example: RSS<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS .91//EN""http://my.netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
<channel><title>freshmeat.net</title><link>http://freshmeat.net</link><description>the one-stop-shop for all your Linux software needs</description><language>en</language><rating>(PICS-1.1 "http://www.classify.org/safesurf/" 1 r (SS~~000 1))</rating><copyright>Copyright 1999, Freshmeat.net</copyright><pubDate>Thu, 23 Aug 1999 07:00:00 GMT</pubDate><lastBuildDate>Thu, 23 Aug 1999 16:20:26 GMT</lastBuildDate><docs>http://www.blahblah.org/fm.cdf</docs>
![Page 10: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/10.jpg)
RSS 2<image><title>freshmeat.net</title><url>http://freshmeat.net/images/fm.mini.jpg</url><link>http://freshmeat.net</link><width>88</width><height>31</height><description>This is the Freshmeat image stupid</description></image>
<item><title>kdbg 1.0beta2</title><link>http://www.freshmeat.net/news/1999/08/23/935449823.html</link><description>This is the Freshmeat image stupid</description></item>
<item><title>HTML-Tree 1.7</title><link>http://www.freshmeat.net/news/1999/08/23/935449856.html</link><description>This is the Freshmeat image stupid</description></item>
![Page 11: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/11.jpg)
RSS 3<textinput><title>quick finder</title><description>Use the text input below to search freshmeat</description><name>query</name><link>http://core.freshmeat.net/search.php3</link></textinput>
<skipHours><hour>2</hour></skipHours>
<skipDays><day>1</day></skipDays>
</channel></rss>
![Page 12: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/12.jpg)
RSS DTD<!ELEMENT rss (channel)><!ATTLIST rss
version CDATA #REQUIRED> <!-- must be "0.91"> -->
<!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? |managingEditor? | webMaster? | skipHours? | skipDays?)*><!ELEMENT title (#PCDATA)><!ELEMENT description (#PCDATA)><!ELEMENT link (#PCDATA)><!ELEMENT language (#PCDATA)><!ELEMENT rating (#PCDATA)><!ELEMENT copyright (#PCDATA)><!ELEMENT pubDate (#PCDATA)><!ELEMENT lastBuildDate (#PCDATA)><!ELEMENT docs (#PCDATA)><!ELEMENT managingEditor (#PCDATA)><!ELEMENT webMaster (#PCDATA)><!ELEMENT hour (#PCDATA)><!ELEMENT day (#PCDATA)><!ELEMENT skipHours (hour+)><!ELEMENT skipDays (day+)>
![Page 13: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/13.jpg)
RSS DTD 2<!ELEMENT item (title | link | description)*><!ELEMENT textinput (title | description | name | link)*><!ELEMENT name (#PCDATA)>
<!ELEMENT image (title | url | link | width? | height? | description?)*><!ELEMENT url (#PCDATA)><!ELEMENT width (#PCDATA)><!ELEMENT height (#PCDATA)>
![Page 14: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/14.jpg)
XML Application Tasksn Convert XML into application objectn Convert application object to XMLn Read or write XML into a databasen XSLT conversion to XHTML or HTML
![Page 15: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/15.jpg)
RSS ObjectsLogical View
RSSChannel
+getTitle()+getDescription()+getLink()+getItems()+getImage()+getTextInput()
RSSItem
+getTitle()+getDescription()+getLink()
RSS Image
+getTitle()+getDescription()+getLink()+getURL()+getHeight()+getWidth()
RSSTextInput
+getTitle()+getDescription()+getLInk()+getName()
![Page 16: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/16.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 17: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/17.jpg)
Parser API’sn What is the job of a parser API?
n Make the structure and contents of a document available
n Report errors
![Page 18: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/18.jpg)
SAX APIn Event callback stylen Representation of elements & atts
n must build own stack
n Development modeln Reasonably openn xml-devn http://www.megginson.com/SAX/SAX2
![Page 19: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/19.jpg)
Handler Strategiesn Single Monolithic
n Requires the handler to know entirely about the grammar
n One per application object and multiplexern Each application object has its own
DocumentHandlern The Multiplexer handler is registered with the
parsern Multiplexer is responsible for swapping SAX
handlers as the context changes.
![Page 20: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/20.jpg)
SAX 1 RSS Handlerpublic void startElement(String name,AttributeList atts) throws SAXException {
currentText = new StringBuffer();textStack.push(currentText);if (name.equals("rss") {
elementStack.push(name);} else if (name.equals("channel")) {
elementStack.push(name);currentChannel = new RSSChannel();currentItems = new Vector();currentChannel.setItems(currentItems);currentImage = new RSSImage();currentChannel.setImage(currentImage);currentTextInput = new RSSTextInput();currentChannel.setTextInput(currentTextInput);currentChannel.setSkipHours(new Vector());currentChannel.setSkipDays(new Vector());
} else if (name.equals("item")) {elementStack.push(name);currentItem = new RSSItem();
} else if (name.equals("image")) {elementStack.push(name);
} else if (name.equals("textinput")) {elementStack.push(name);
} else if (name.equals("skipHours")) {elementStack.push(name);
} else if (name.equals("skipDays")) {elementStack.push(name);
} else {}
}
![Page 21: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/21.jpg)
SAX 1 RSS Handler 2public void endElement(String name) throws SAXException {
String stackValue = (String) elementStack.peek();String text = ((StringBuffer)textStack.pop()).toString();if (name.equals("rss")) {
elementStack.pop();} else if (name.equals("channel")) {
elementStack.pop();} else if (name.equals("title")) {
if (stackValue.equals("channel")) {currentChannel.setTitle(text);
} else if (stackValue.equals("image")) {currentImage.setTitle(text);
} else if (stackValue.equals("item")) {currentItem.setTitle(text);
} else if (stackValue.equals("textinput")) {currentTextInput.setTitle(text);
} else {}
...
![Page 22: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/22.jpg)
SAX 1 RSS Handler 3} else if (name.equals("description")) {
if (stackValue.equals("channel")) {currentChannel.setDescription(text);
} else if (stackValue.equals("image")) {currentImage.setDescription(text);
} else if (stackValue.equals("item")) {currentItem.setDescription(text);
} else if (stackValue.equals("textinput")) {currentTextInput.setDescription(text);
} } else if (name.equals("language")) {
currentChannel.setLanguage(text);} else if (name.equals("item")) {
currentItems.add(currentItem);elementStack.pop();
} else if (name.equals("textinput")) {currentChannel.setTextInput(currentTextInput);elementStack.pop();
}}
![Page 23: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/23.jpg)
SAX1 RSS Handler 4public void characters(char[] ch,int start,int length) throws SAXException {
currentText.append(ch, start, length);
}
![Page 24: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/24.jpg)
SAX 1 RSS Driverpublic class SAX1RSSParse extends Object {
public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);try {
p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {
System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();
} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();
}System.out.println(((SAX1RSSHandler)h).getChannel().toString());
}}
![Page 25: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/25.jpg)
SAX 1 Basic Event Handlingn Basic Event Handling
n attribute handling in startElementn characters() for contentn character data handling in endElement
n Multiple characters calls
![Page 26: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/26.jpg)
SAX 1 RSS ErrorHandlerpublic class SAX1RSSErrorHandler implements ErrorHandler {public void warning(SAXParseException ex) throws SAXException {
System.err.println("[Warning] ”+getLocationString(ex)+":”+ex.getMessage());
}
public void error(SAXParseException ex) throws SAXException {System.err.println("[Error] "+getLocationString(ex)+":”+
ex.getMessage());}
public void fatalError(SAXParseException ex) throwsSAXException {
System.err.println("[Fatal Error]"+getLocationString(ex)+":”+ex.getMessage());
throw ex;}
![Page 27: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/27.jpg)
SAX1 RSS ErrorHandler 2private String getLocationString(SAXParseException ex) {
StringBuffer str = new StringBuffer();
String systemId = ex.getSystemId();
if (systemId != null) {
int index = systemId.lastIndexOf('/');
if (index != -1)
systemId = systemId.substring(index + 1);
str.append(systemId);
}
str.append(':');
str.append(ex.getLineNumber());
str.append(':');
str.append(ex.getColumnNumber());
return str.toString();
}
}
}
![Page 28: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/28.jpg)
ErrorDriverpublic class SAX1RSSParse extends Object {
public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);ErrorHandler eh = new SAX1RSSErrorHandler();p.setErrorHandler(eh);try {
p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {
System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();
} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();
}System.out.println(((SAX1RSSHandler)h).getChannel().toString());
}}
![Page 29: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/29.jpg)
SAX 1 Error Handlingn Error Handling
n warningn error – parser may recovern fatalError – XML 1.0 spec errors
n Locators
![Page 30: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/30.jpg)
Entity Handlingn <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS91//EN"
"http://my.netscape.com/publish/formats/rss-0.91.dtd">
n This will do a network fetch!
public class SAX1RSSEntityHandler implements EntityResolver {public SAX1RSSEntityHandler() {}public InputSource resolveEntity(String publicId,String systemId) throws SAXException, IOException {if (publicId.equals("-//Netscape Communications//DTD RSS
91//EN")) {FileReader r = new FileReader("/usr/share/xml/rss91.dtd");return new InputSource(r);
}return null;
}}
![Page 31: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/31.jpg)
EntityDriverpublic class SAX1RSSParse extends Object {
public static void main(String args[]) {Parser p = new SAXParser();DocumentHandler h = new SAX1RSSHandler();p.setDocumentHandler(h);ErrorHandler eh = new SAX1RSSErrorHandler();p.setErrorHandler(eh);EntityResolver r = new SAX1RSSEntityResolver();p.setEntityResolver(r);try {
p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");} catch (SAXException se) {
System.out.println("Error during parsing"+se.getMessage());se.printStackTrace();
} catch (IOException e) {System.out.println("I/O Error during parsing+e.getMessage());e.printStackTrace();
}System.out.println(((SAX1RSSHandler)h).getChannel().toString());
}}
![Page 32: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/32.jpg)
HandlerBase
n The kitchen sinkn Lets you do DocumentHandler,
ErrorHandler, EntityResolver and DTD Handler
n A good way to go if you are only overriding a small number of methods
![Page 33: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/33.jpg)
SAX 2 Beta
n Configurablen DTDn ContentHandler replaces
DocumentHandler (NS Aware)n Attributes replaces AttributeList (NS
Aware)n XMLReader replaces Parsern XMLFiltern DefaultHandler replaces HandlerBase
![Page 34: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/34.jpg)
SAX 2 Betan Optional LexicalHandler
n Entity boundariesn DTD bounariesn CDATA sectionsn Comments
n Optional DeclHandlern ElementDecls and AttributeDeclsn Internal and External Entity Declsn NO parameter entities
n Compatibility Adapters
![Page 35: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/35.jpg)
SAX 2 Configurablen Uses strings (URI’s) as lookup keysn Features - booleans
n validationn namespaces
n Properties - Objectsn Extensibility by returning an object that
implements additional function
![Page 36: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/36.jpg)
Xerces and Configurablen Standard way to set all parser
configuration settings.n Applies to both SAX and DOM Parsers
![Page 37: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/37.jpg)
Xerces Featuresn Inclusion of general entitiesn Inclusion of parameter entitiesn Dynamic validationn Extra warningsn Allow use of java encoding namesn Lazy evaluating DOMn DOM EntityReference creationn Inclusion of igorable whitespace in DOM
![Page 38: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/38.jpg)
Xerces Propertiesn Name of DOM implementation classn DOM node currently being parsed
![Page 39: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/39.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 40: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/40.jpg)
DOM APIn Object representation of elements &
attsn Development Model
n W3C Recommendations and process
n http:///www.w3.org/DOM
![Page 41: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/41.jpg)
DOM ArchitectureNode
+getNodeName()+getNodeValue()+getChildNodes()+getFirstChild()+getLastChild()+getNextSibling()+getPreviousSibling()+insertBefore()+appendChild()+replaceChild()+removeChild()
Document
Element
+getAttribute()+setAttribute()+removeAttribute()+getTagName()+getAttributeNode()+setAttributeNode()+removeAttributeNode()+normalize()
Attr
Text CDataSection
EntityReference Entity
ProcessingInstruction
Comment
DocumentType
NotationCharacterData
NodeList NamedNodeMap
![Page 42: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/42.jpg)
DOM Instance (XML)
<?xml version="1.0" encoding="US-ASCII"?>
<a>
<b>in b</b>
<c>
text in c
<d/>
</c>
</a>
![Page 43: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/43.jpg)
DOM Tree
Document
Elementa
Elementb
Elementc
Elementd
Text[lf]
Text[lf]text in c[lf]
Text[lf]
Text[lf]
Text[lf]
Textin b
![Page 44: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/44.jpg)
DOM RSSpublic static void main(String args[]) {
DOMParser p = new DOMParser();
try {p.parse("file:///Work/ApacheCon/fm0.91_full.rdf");System.out.println(dom2Channel(p.getDocument()).toString());
} catch (SAXException se) {System.out.println("Error during parsing "+se.getMessage());se.printStackTrace();
} catch (IOException e) {System.out.println("I/O Error during parsing “+e.getMessage());e.printStackTrace();
}}
![Page 45: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/45.jpg)
DOM2Channelpublic static RSSChannel dom2Channel(Document d) {
RSSChannel c = new RSSChannel();Element channel = null;NodeList nl = d.getElementsByTagName("channel");channel = (Element) nl.item(0);for (Node n = channel.getFirstChild(); n != null;
n = n.getNextSibling()) {if (n.getNodeType() != Node.ELEMENT_NODE)continue;
Element e = (Element) n;e.normalize();String text = e.getFirstChild().getNodeValue();if (e.getTagName().equals("title")) {c.setTitle(text);
} else if (e.getTagName().equals("description")) {c.setDescription(text);
} else if (e.getTagName().equals("link")) {c.setLink(text);
} else if (e.getTagName().equals("language")) {
...
![Page 46: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/46.jpg)
DOM2Channel 2…
}nl = channel.getElementsByTagName("item");c.setItems(dom2Items(nl));nl = channel.getElementsByTagName("image");c.setImage(dom2Image(nl.item(0)));nl = channel.getElementsByTagName("textinput");c.setTextInput(dom2TextInput(nl.item(0)));nl = channel.getElementsByTagName("skipHours");c.setSkipHours(dom2SkipHours(nl.item(0)));nl = channel.getElementsByTagName("skipDays");c.setSkipDays(dom2SkipDays(nl.item(0)));return c;
}
![Page 47: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/47.jpg)
DOM Level 1n Attr
n API for result as objectsn API for result as strings
n No DTD supportn Need Import - copying between doms is
bad
![Page 48: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/48.jpg)
DOM Level 2n Core / Namespaces
n Import
n Traversaln Eventsn Rangen Stylesheets
![Page 49: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/49.jpg)
Xerces DOM extensionsn Feature controls insertion of ignorable
whitespacen Feature controls creation of
EntityReference Nodesn User data object provided on
implementation class org.apache.xml.dom.NodeImpl
![Page 50: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/50.jpg)
Xerces Lazy DOMn Delays object creation
n a node is accessed - its object is createdn all siblings are also created.
n Reduces time to complete parsing n Reduces memory usagen Good if you only access part of the
DOM
![Page 51: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/51.jpg)
Tricky DOM stuffn item() based access
n use getNextSibling()
n Entity and EntityReference nodes –depends on parser settings
n Ignorable whitespacen Subclassing the DOM
n Use the Document factory class
![Page 52: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/52.jpg)
DOM vs SAXn DOM
n creates a document object treen tree must the be reparsed & converted to
BO’sn Could subclass BO’s from DOM classes
n SAXn event handler based APIn stack managementn no intermediate data structures generated
![Page 53: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/53.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 54: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/54.jpg)
Validationn When to use validationn non-validating != wf checking
n Not required to read external entities
n validation dial settings
![Page 55: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/55.jpg)
Namespacesn Purpose
n Syntactic discrimination onlyn They don’t point to anythingn They don’t imply schema composition/combination
n Universal Namesn URI + Local Name
{http://www.mydomain.com}Elementn Declarations
n xmlns “attribute”
n Prefixesn Scopingn Validation and DTDs
![Page 56: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/56.jpg)
Namespace Example<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:html="http://www.w3.org/TR/xhtml1/strict">
<xsl:template match="/">
<html:html>
<html:head>
<html:title>Title</html:title>
</html:head>
<html:body>
<xsl:apply-templates/>
</html:body>
</xsl:template>
...
</xsl:stylesheet>
![Page 57: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/57.jpg)
Default Namespace Example<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/strict">
<xsl:template match="/">
<html>
<head>
<title>Title</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</xsl:template>
...
</xsl:stylesheet>
![Page 58: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/58.jpg)
SAX2 ContentHandlerpublic class NamespaceDriver extends DefaultHandler implements ContentHandler {
public void startPrefixMapping(String prefix,String uri) throws SAXException {
System.out.println("startMapping "+prefix+", uri = "+uri);
}
public void endPrefixMapping(String prefix) throws SAXException {
System.out.println("endMapping "+prefix);
}
public void startElement(String namespaceURI,String localName,String
rawName,Attributes atts) throws SAXException {
System.out.println("Start: nsURI="+namespaceURI+" localName= "+localName+"rawName = "+rawName);
printAttributes(atts);
}
public void endElement(String namespaceURI,String localName,String rawName) throws SAXException {
System.out.println("End: nsURI="+namespaceURI+" localName= "+localName+"rawName = "+rawName);
}
![Page 59: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/59.jpg)
SAX2 ContentHandler 2public void printAttributes(Attributes atts) {
for (int i = 0; i < atts.getLength(); i++) {
System.out.println("\tAttribute "+i+" URI="+atts.getURI(i)+
" LocalName="+atts.getLocalName(i)+
" Type="+atts.getType(i)+" Value="+atts.getValue(i));
}
}
![Page 60: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/60.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 61: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/61.jpg)
XML Scheman Richer grammar specificationn W3C Working Draftn Structures
n XML Instance Syntaxn Namespacesn Weird content models
n Datatypesn Lots
n Prototype Support in Xerces
![Page 62: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/62.jpg)
RSS Schema<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 20000225//EN"
"http://www.w3.org/TR/2000/WD-xmlschema-1-20000225/structures.dtd"><xsd:schema
targetNamespace="http://my.netscape.com/publish/formats/rss-0.91"><xsd:element name="rss"><xsd:complexType>
<xsd:element ref="channel"/><xsd:attribute name="version" fixed="0.91" minOccurs="1"/>
</xsd:complexType></xsd:element>
![Page 63: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/63.jpg)
RSS Schema 2<xsd:element name="channel"><xsd:complexType>
<xsd:element ref="title"/><xsd:element ref="description"/><xsd:element ref="link"/><xsd:element ref="language"/><xsd:element ref="item" minOccurs="1" maxOccurs="*"/><xsd:element ref="rating" minOccurs="0"/><xsd:element ref="image" minOccurs="0"/><xsd:element ref="textinput" minOccurs="0"/><xsd:element ref="copyright" minOccurs="0"/><xsd:element ref="pubDate" minOccurs="0"/><xsd:element ref="lastBuildDate" minOccurs="0"/><xsd:element ref="docs" minOccurs="0"/><xsd:element ref="managingEditor" minOccurs="0"/><xsd:element ref="webMaster" minOccurs="0"/><xsd:element ref="skipHours" minOccurs="0"/><xsd:element ref="skipDays" minOccurs="0"/>
</xsd:complexType></xsd:element>
![Page 64: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/64.jpg)
RSS Schema 3...
<xsd:element name="name" type="xsd:string"/><xsd:element name="rating" type="xsd:string"/><xsd:element name="language" type="xsd:string"/><xsd:element name="width" type="xsd:integer"/><xsd:element name="height" type="xsd:integer"/><xsd:element name="copyright" type="xsd:string"/><xsd:element name="pubDate" type="xsd:date"/><xsd:element name="lastBuildDate" type="xsd:date"/><xsd:element name="docs" type="xsd:string"/><xsd:element name="managingEditor" type="xsd:string"/><xsd:element name="webMaster" type="xsd:string"/><xsd:element name="hour" type="xsd:time"/><xsd:element name="day" type="xsd:string"/>
…
</xsd:schema>
![Page 65: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/65.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 66: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/66.jpg)
Grammar Accessn No DOM standardn SAX 2n Common Applications
n Editorsn Stub generators
n Also experimental API in Xerces
![Page 67: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/67.jpg)
DTD Access Examplepublic class DTDDumper implements DeclHandler {
public void elementDecl (String name, String model) throws SAXException {
System.out.println("Element "+name+" has this content model: "+model);
}
public void attributeDecl (String eName, String aName, String type,
String valueDefault, String value) throwsSAXException {
System.out.print("Attribute "+aName+" on Element "+eName+" has type "+type);
if (valueDefault != null)
System.out.print(", it has a default type of "+valueDefault);
if (value != null)
System.out.print(", and a default value of "+value);
System.out.println();
}
![Page 68: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/68.jpg)
DTD Access Example 2public void internalEntityDecl (String name, String value) throwsSAXException {
String peText = name.startsWith("%") ? "Parameter " : "";
System.out.println("Internal "+peText+"Entity "+name+" has value: "+value);
}
public void externalEntityDecl (String name, String publicId, StringsystemId) throws SAXException {
String peText = name.startsWith("%") ? "Parameter " : "";
System.out.print("External "+peText+"Entity "+name);
System.out.print("is available at "+systemId);
if (publicId != null)
System.out.print(" which is known as "+publicId);
System.out.println();
}
![Page 69: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/69.jpg)
DTD Access Driverpublic static void main(String args[]) {
try {
XMLReader r = new SAXParser();
r.setProperty("http://xml.org/sax/properties/declaration-handler",
new DTDDumper());
r.parse(args[0]);
} catch (Exception e) {
e.printStackTrace();
}
}
![Page 70: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/70.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 71: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/71.jpg)
Round Trippingn Use SAX2 - DOM can’t do itn XML Decln Comments (SAX2)n PE Decls
![Page 72: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/72.jpg)
“Serializing”n Several choices
n XMLSerializern TextSerializern HTMLSerializer
n XHTMLSerializer
n Work as either:n SAX DocumentHandlersn DOM serializers
![Page 73: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/73.jpg)
Serializer Example
DOMParser p = new DOMParser();
p.parse("file:///twl/Work/ApacheCon/fm0.91_full.rdf");
Document d = p.getDocument();
// XML
OutputFormat format = new OutputFormat("xml", "UTF-8", true);
XMLSerializer serializer = new XMLSerializer(System.out, format);
serializer.serialize(d);
// XHTML
format = new OutputFormat("xhtml", "UTF-8", true);
serializer = new XHTMLSerializer(System.out, format);
serializer.serialize(d);
![Page 74: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/74.jpg)
Entity Handlingn Entity / Catalog handling
n Catalog handling is mostly for document processing (SGML) applications
n Pluggable entity handling is important for server applications.n Provides access point to DBMS, LDAP, etc.
n predefined character entitiesn amp, lt, gt, apos, quot
![Page 75: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/75.jpg)
Concurrencyn “Thread safety”
n Xerces allows multiple parser instances, one per thread
n Xerces does not allow multiple threads to enter a single parser instance.
n org.apache.xerces.framework.XMLParser#parseSome()
![Page 76: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/76.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 77: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/77.jpg)
Grammar Designn Attributes vs Elements
n performance tradeoff vs programming tradeoff
n At most 1 attribute, many subelements
n ID & IDREFn NMTOKEN vs CDATAn CDATASECTIONsn Ignorable Whitespace
![Page 78: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/78.jpg)
Grammar Design 2n Everything is Stringsn Data modelling
n Relationships as elementsn modelling inheritance
n Plan to use schema datatypesn avoid weird types
![Page 79: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/79.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 80: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/80.jpg)
Performance Tipsn skip the DOMn reuse parser instances (code)
n reset() method
n defaulting is slown external anythings are slower, entities
are slowern only validate if you have to, validate
once for all timen use fast encoding (US-ASCII)
![Page 81: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/81.jpg)
Outlinen Overviewn xml.apache.org projectn Basic XML Conceptsn SAX 1 / 2 n DOM Parsingn Namespacesn XML Scheman Grammar Accessn Round Trippingn Grammar Designn Performance tipsn Xerces Architecture
![Page 82: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/82.jpg)
Xerces Architecturen Use SAX based infrastructure where
possible to avoid reinventing the wheeln Modular / Framework designn Prefer composition for system
components
![Page 83: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/83.jpg)
Xerces Architecture Diagram
SAXParser DOMParser DOM Implementations
Error Handling XMLParser XMLDTDScanner DTDValidator
XMLDocumentScanner XSchemaValidator
Entity Handling
Readers
![Page 84: ApacheCon 2000 Everything you ever wanted to know about XML Parsing](https://reader035.vdocuments.us/reader035/viewer/2022081715/54bdd7154a7959f5368b45f6/html5/thumbnails/84.jpg)
Things we don’t don HTML parsing (yet)
n java Tidy http://www3.sympatico.ca/ac.quick/jtidy.html
n Grammar caching (yet)n Parse several documents in a stream