copyright 2001, activestate python and xml. copyright 2001, activestate about me paul prescod,...

94
Copyright 2001, ActiveState Python and XML

Upload: marilyn-patrick

Post on 23-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Copyright 2001, ActiveState

Python and XML

Copyright 2001, ActiveState

About me

• Paul Prescod, ([email protected])

• ActiveState Senior Developer

• Co-Author, XML Handbook

Copyright 2001, ActiveState

Preview

• About Python

• Python SAX/DOM

• PyXML Package

• Python XSLT/XPath

• Python SOAP/XML-RPC

• XML and Zope

Copyright 2001, ActiveState

What is Python?

• Python is an easy to learn, powerful programming language.

– Efficient high-level data structures

– Simple approach to object-oriented programming.

– Elegant syntax and dynamic typing

Copyright 2001, ActiveState

Brief History of Python

• CWI, early 90s.• Dynamic Object Oriented High Level

Language.• More than a text processing language.• More than a scripting language.• Scalable and object oriented from the

beginning.• Dynamically type checked.

Copyright 2001, ActiveState

Python's business case

• Python can displace many other languages in the organization.

• The Python interpreter is free.• Python is legally unencumbered.• Professional programmers find Python

more flexible than most languages.• Amateur programmers are (often) more

comfortable than with Perl or Java.

Copyright 2001, ActiveState

Usability features

• Exceptionally clear syntax.

• Provides an obvious way to do most things.

• Small set of features combine in powerful ways.

• Only innovative where innovation is really necessary.

Copyright 2001, ActiveState

More Usability features

• Huge amount of free code and libraries• Interactive.• Designed to talk to the world.• Runs with Unix, Mac and Windows.• Integrates with JVM (Jython) and .NET

Framework (Python.NET)• Talks MS COM, XPCOM,

CORBA,SOAP, XML-RPC, …

Copyright 2001, ActiveState

Scalability features

• Simple but powerful module system.

• Simple but powerful class system.

• Structured, standardized exceptions.

Copyright 2001, ActiveState

Environments

• Unix (almost all)

• Windows (3.1, 95, NT, CE)

• Mac

• JVM

• Various legacy systems...

Copyright 2001, ActiveState

Extendable

• New data types -- in Python or C

• Modules -- in Python or C

• Functions -- in Python or C

Copyright 2001, ActiveState

Python isn't picky!

• COM/CORBA

• HTML/XML/SGML

• Win API/POSIX

• You can write code that is portable or platform-specific.

Copyright 2001, ActiveState

Compared to Perl

• Simpler syntactically.

• More object oriented.

• Easier to extend.

• But slower regular expressions...

Copyright 2001, ActiveState

Compared to Java

• Java is more difficult for amateur programmers.

• Static type checking can be inconvenient in text processing.

• Puritanical OO can be inconvenient.

• Bottom line: Java can make simple projects harder.

Copyright 2001, ActiveState

Why not Java: political

• "100% pure Java" gets in the way.

• The Java environment punishes interoperability. (e.g. getenv is deprecated)

• Java is designed to have interoperability limitations.

• Embedding Java is relatively painful.

Copyright 2001, ActiveState

Jython (nee JPython)

• Compiles Python classes to Java classes

• Embedded interpreter allows interactive coding.

• Access to all Java classes.

• For better or worse: maintains Java's security/platform-independence bubble.

Copyright 2001, ActiveState

Jython can use Java tools

• RDF

• XPointer

• Various parsers

• Swing GUI

• Unicode

Copyright 2001, ActiveState

Python Limitations

• “Ordinary Python" has 8-bit and Unicode string types.– Handling explicit conversions can be annoying.

• Not as fast as C++.• Raw text searching is not as fast as Perl.• Dynamic type checking requires more care in

testing.

Copyright 2001, ActiveState

Python “Hello world"

print "Hello, World“

Copyright 2001, ActiveState

Python interpreter

• Just type:C:\> pythonPython 1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] on win32

Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

>>> print "Hello, World"Hello, Python>>> print "Goodbye, World "Goodbye, Python>>> ^Z

C:\>

Copyright 2001, ActiveState

Byte-compiling

• Python automatically bytecompiles modules.

• Next execution does not require compilation.

• .py files get a .pyc in the same directory

• When the .py is updated, the .pyc is updated

Copyright 2001, ActiveState

Interpreters

• DOS/Win32 (last slide)

• Unix (use ^D to exit)

• Graphical: “IDLE”, “PythonWin”

Copyright 2001, ActiveState

Python variables

• Any Python variable can hold any value.>>> width = 20>>> height = 5 * 9>>> width * height900>>> width = "really wide“>>> width'really wide'

Copyright 2001, ActiveState

Numeric types

• int: 32 bit, e.g. "x=5"

• long: arbitrary sized, e.g. "x=2L**128"

• float: accuracy depends on platform, e.g. "x=3.14"

• complex: real+imag., "x=5.3+3.2j"

Copyright 2001, ActiveState

Sequence types:

• Strings: "abcd"

• Tuples: (1,2,"b")

• Lists: [1,"a",3]

Copyright 2001, ActiveState

Sequence operations

• Iteration:for i in myList:print i

• Numeric indexing:k = myList[3]

• Slicing:k = mylist[2:5]

Copyright 2001, ActiveState

Sequence types: string

myStr = "abc" # assignment

myStr = myStr + "def" # = "abcdef"

for char in myStr: print char # iterateotherstr = myStr[1:4] # = "bcd"

Copyright 2001, ActiveState

Sequence types: lists

myList = ["a",5,3.25,2L,4+3j] anotherList = ["a",myList, ["3","2"]]anotherList2 = myList + myList # = ["a",5,...,"a",5,...]yetAnotherList = myList[1:3] # = [5,3.25]

Copyright 2001, ActiveState

Iterating over sequences

strlist = ["abc", "def", "ghi"]for item in strlist: for char in item: print char

Copyright 2001, ActiveState

Sequence Concatenation

>>> word = 'Help' + 'A'>>> word'HelpA'>>> list = ["Hello"] + ["World"]>>> print list['Hello', 'World']

Copyright 2001, ActiveState

Sequence Indexing

>>> str="abc">>> str[0]'a'>>> str[1]'b'

Copyright 2001, ActiveState

Negative indexes

>>> word[-1] # The last character'A'>>> word[-2] # The last-but-one

character'p'>>> word[-2:] # The last two characters'pA'>>> word[:-2] # All but the last two

characters'Hel'

Copyright 2001, ActiveState

Getting the length

• The len() function gets a sequence's length

>>> len( "abc" )3>>> len( ["abc","def"] )2

Copyright 2001, ActiveState

Tuples

• Immutable list-like objects are called "tuples“

>>> a=(1,2)>>> a[0]=3Traceback (innermost last): File "<stdin>", line 1, in ?TypeError: object doesn't support item assignment

Copyright 2001, ActiveState

Dictionaries

• Serve as a lookup table

• Maps "keys" to "values".

• Keys can be of any immutable type

• Assignment adds or changes members

• keys() method returns keys

Copyright 2001, ActiveState

Dictionaries

>>> dict={"a":"alpha", "b":"bravo","c":"charlie"}

>>> dict["abc"]=10>>> dict[5]="def">>> dict[2.52]=6.71>>> print dict{2.52: 6.71, 5: 'def', 'abc': 10, 'b': 'bravo', 'c': 'charlie', 'a': 'alpha'}

Copyright 2001, ActiveState

Dictionary Methods

>>> dict.keys()[2.52, 5, 'abc', 'b', 'c', 'a']>>> dict.values()[6.71, 'def', 10, 'bravo', 'charlie', 'alpha']

>>> dict.items()[(2.52, 6.71), (5, 'def'), ('abc', 10), ('b', 'bravo'), …]

>>> dict.clear()>>> print dict{}

Copyright 2001, ActiveState

File Objects

• Represent opened files:myFile = open( "catalog.txt", "r" )data = myFile.read()myFile = open( "catalog2.txt", "w" )data = data+ "more data"myFile.write( data )

Copyright 2001, ActiveState

Function definitions

• Encapsulate bits of code.

• Can take a fixed or variable number of arguments.

• Arguments can have default values.

Copyright 2001, ActiveState

Functions are objects

>>> def myClickFunction():... print "I was clicked"...>>> # assume button is a GUI button>>> button.OnClick = myClickFunction>>> print button.OnClick.__name__myClickFunction>>>

Copyright 2001, ActiveState

Flow Control Statements

• if/then/else

• while

• for

• try

Copyright 2001, ActiveState

Exception handling

• Python exception handling like Java/C++.

• Errors are reported in tracebacks.

• Exceptions propagate up.

Copyright 2001, ActiveState

Exception traceback

Traceback (innermost last): File "test.py", line 10, in ? a() File "test.py", line 2, in a b( ) File "test.py", line 5, in b c( ) File "test.py", line 8, in c 1/0ZeroDivisionError: integer division or modulo

Copyright 2001, ActiveState

Classes

• Classes combine code and data.• They represent real world objects.• We create "instance objects" from classes.• Closest languages in terms of object model

are SmallTalk or Ruby.• Much more flexible than Java or C++• More central to the language than

Perl/Tcl/PHP.

Copyright 2001, ActiveState

Inheritance

• Classes can specify a base class.• The new class "inherits" methods and data.• The new class can

– "override" methods.– add data and methods.

• Multiple Inheritance is okay• All methods are virtual.

Copyright 2001, ActiveState

Modules and Packages

• A module is a set of code in a single file.

• A package is a collection of related modules.

Copyright 2001, ActiveState

XML and Python

• Accessing XML with Python

• Parsing XML with Python

– Non validating Parsers

– Validating Parsers

Copyright 2001, ActiveState

Reading XML

• XML as a character data stream

– the RE module

• XML as a tree structure

– lists of node objects

• XML as an event source

– event dispatching to methods

Copyright 2001, ActiveState

Parsers in Python

• C extension modules

– PyExpat

– sgmlop

• Written in Python code:

– xmllib

– xmlproc

Copyright 2001, ActiveState

Parsers for Jython

• Apache

• Sun XML

• XP

• Oracle

• ...

Copyright 2001, ActiveState

Manipulating XML

• Flat file processing with RE's (briefly!)

• PySAX - Simple API for XML

• PyDOM - W3C Document Object Model

• …

Copyright 2001, ActiveState

Flat File Processing

• XML documents are text.

• Ordinary textual tools continue to work.

• E.G. Search for emph elements:import re

for i in re.search( r"<emph>(.*)</emph>", input ): print i

Copyright 2001, ActiveState

Flat File Recipe

• Unless your needs are very simple, let me help you!

• I’ve already converted the ultimate XML parsing regular expression to Python:

http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/65125

Copyright 2001, ActiveState

Events

• Think of an XML document as a series of events

• "Start tag", "End tag", “Characters", etc.

• We can handle hierarchy by tracking start/end tags.

• We can deal with the document a little at a time.

Copyright 2001, ActiveState

PySAX

• "Simple API for XML"

• Common API for parsers.

• Based on Java API.

• Parser implements certain interfaces.

• Application implements callback interfaces.

Copyright 2001, ActiveState

SAX Model

• The application hands the parser an event handler object.

• The parser sends events to the handler.• The handler can

– store them somehow,– build something,– re-route them to other parts of the

app.

Copyright 2001, ActiveState

Application side

• Applications must provide:– ContentHandler– ErrorHandler– DTDHandler– EntityResolver

• Parser developer implements:– XMLReader– A few more (out of scope)

Copyright 2001, ActiveState

ContentHandler

• Captures document instance events.

• App can:

– Build app. objects.

– Output something.

– Build a GUI

– ...

Copyright 2001, ActiveState

ContentHandler callbacks

• Main ones:

startElement(name, attrs)

endElement(name)

characters(content)

ignorableWhitespace(ch, start, length)

processingInstruction(target, data)

(cont’d)

Copyright 2001, ActiveState

ContentHandler egfrom xml.sax.handler import \ ContentHandler

class countHandler(ContentHandler): def __init__(self): self.tags={}

def startElement(self, name, attr): if not self.tags.has_key(name): self.tags[name] = 0 self.tags[name] += 1

Copyright 2001, ActiveState

ContentHandler eg

import xml.sax

parser = xml.sax.make_parser()

handler = countHandler()

parser.setContentHandler(handler)

parser.parse("test.xml")

print handler.tags

Copyright 2001, ActiveState

PySax Distribution

• Default content handler implementation is provided.

• Subclass can override only what it needs.

• Function to get parser is also provided.

Copyright 2001, ActiveState

ErrorHandling

• In addition to content handler,• we should assign an error handler.

class MyErrorHandler: def warning(self, exception):

print "Whoa, nelly!" print exception

def error(self, exception): print "Whoa, nelly!" raise exception

def fatalError(self, exception): print "Whoa, nelly!" raise exception

Copyright 2001, ActiveState

ErrorHandling (cont'd)

...errHandler = MyErrorHandler() parser.setErrorHandler( errHandler )parser.parse("\\temp\\test.xml")

Copyright 2001, ActiveState

Character handling

# print out characters in documentfrom xml.sax.handler import ContentHandler import xml.sax, sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser() parser.setContentHandler(textHandler()) parser.parse("test.xml")

Copyright 2001, ActiveState

Document Object Model

• Document Object Model

• The DOM is a W3C standard.

• Extended version of "Dynamic HTML"

• Defined in CORBA IDL.

• Implemented in various languages.

• Implemented in IE5.0 and eventually Netscape

Copyright 2001, ActiveState

The DOM

• The DOM is a tree-based API.

• This implies a certain amount of overhead.

• But also a lot of convenience and flexibility.

• XPath implementation essentially requires tree-based APIs.

Copyright 2001, ActiveState

DOM Nodes

• Elements, attributes, comments, etc. called "nodes".

• Classes represent node types.

• All node types subclass the "node" base class.

Copyright 2001, ActiveState

Node Objects

• Example methods include:

– getNodeType

– getParentNode

– getChildNodes

– getAttributes

– insertBefore

– cloneNode

Copyright 2001, ActiveState

Element Objects

• Elements are a representative subclass:

• getTagName

• getAttribute

• setAttribute

• getElementsByTagName

Copyright 2001, ActiveState

DOM node types

ATTRIBUTECDATA_SECTIONCOMMENTDOCUMENTDOCUMENT_FRAGMENTDOCUMENT_TYPE

Copyright 2001, ActiveState

More DOM node types

ELEMENTENTITYENTITY_REFERENCE NOTATIONPROCESSING_INSTRUCTIONTEXT

Copyright 2001, ActiveState

Navigation properties

• parentNode - Parent of this node• firstChild - First child of this node• lastChild - Last child of this node• previousSibling - Node immediately preceding

this node• nextSibling - Node immediately following this

node• childNodes - List containing all the children of

this node

Copyright 2001, ActiveState

Example

<folder> <title>XML bookmarks</title> <bookmark href="http://www.python.org/sigs/xml-sig/" >

<title>SIG for XML Processing in Python</title>

</bookmark></folder>

Copyright 2001, ActiveState

Treefolder

title bookmark

title

“XML Book….” “Sig for …”

“http://…”href

Copyright 2001, ActiveState

First "title" node

Properties:

• parentNode: folder element• firstChild: Text node 'XML bookmarks'• lastChild: Text node 'XML bookmarks'• previousSibling: codeNone• nextSibling: bookmark element• childNodes: A 1-element list: [ Text node

'XML bookmarks' ]

Copyright 2001, ActiveState

DOM

• The DOM API is very large and beyond the scope of the tutorial.

• A few short examples will illustrate the basic model.

Copyright 2001, ActiveState

Building a DOM

from xml.dom import minidom

dom = minidom.parse("test.xml")rootel = dom.documentElementprint rootel.nodeNametopnodes = rootel.childNodes

for toplevel in topnodes : print toplevel.nodeName

Copyright 2001, ActiveState

Searching a DOM

# print the last point element # in the treeprint h.document.documentElement.\ getElementsByTagName('point')[-1]

Copyright 2001, ActiveState

Modifying a DOM

appendChild(newChild)

insertBefore(newChild, refChild)

replaceChild(newChild, oldChild)

removeChild(oldChild)

Copyright 2001, ActiveState

The Document Node

• One Document node per document.

• The base of the entire tree

• documentElement attribute contains a single Element node

• childNodes may have additional children, such as ProcessingInstruction nodes.

Copyright 2001, ActiveState

PyXML Package

• http://pyxml.sourceforge.net

• Collection of lots of useful Python XML stuff.

• Collectively maintained.

Copyright 2001, ActiveState

PyDOM

• A richer, more robust DOM than minidom.

• More classes, support for DOM 2+

• Integration with XPath and XSLT

Copyright 2001, ActiveState

PyXML Marshalling

• Convert Python types into XML

• xml.marshal.generic – generic base class

• xml.marshal.wddx – marshal Python types as WDDX

• xml.marshal.xmlrpc – marshal Python types as XML-RPC elements

Copyright 2001, ActiveState

PyXML Parsers

• Xml.parsers.xmlproc• Qp_xml• Xml.sax.drivers

Copyright 2001, ActiveState

PyTrex

• PyTrex is a schema processor for the TREX schema language

• http://sourceforge.net/projects/pytrex/

• http://www.thaiopensource.com/trex/

Copyright 2001, ActiveState

Python SOAP/XML-RPC

• PythonWare distributes the XML-RPC client: www.pythonware.com

• There are various SOAP implementations:– SOAP.py : http://www.actzero.com – soaplib.py : http://www.pythonware.com– 4Suite: http://4suite.org/– …

Copyright 2001, ActiveState

Python SOAP Example

• SOAP.py:

import SOAP

server = SOAP.SOAPProxy( "http://localhost:8000/")

print server.echo("Hello world")

Copyright 2001, ActiveState

XML and Zope

• Zope is an Open Source application server that publishes objects on the Internet.

• ParsedXML: Breaks up an XML document into bits.

• XML-RPC: You can plumb the depths of Zope with XML-RPC.

• Zcatalog: Index based on element-type names, attribute names, etc.

Copyright 2001, ActiveState

ParsedXML

• A free Zope “product” (extension)

• Every element is a first-class Zope object.

• You can add “behavior” to XML documents

• RSS Channel Product

Copyright 2001, ActiveState

Zope XML-RPC

d=xmlrpclib.Server(

'http://localhost:8080/Zope')

content=d.document_src()

content=content.replace( 'test', 'CHANGED')

d.manage_upload(content)

Copyright 2001, ActiveState

Redfoot

• Redfoot is a framework for distributed RDF-based applications, written in Python.– an RDF database – a query API for RDF– an RDF parser and serializer – a simple HTTP server providing a web interface

for viewing and editing RDF – a fully customizable UI – the beginnings of a peer-to-peer architecture for

communication between different RDF databases

Copyright 2001, ActiveState

More Information

• XML Topic Guide– http://www.python.org/topics/xml/

• SIG – http:///www.python.org/sigs/

• ActiveState Programmers Network– http://www.activestate.com/ASPN

• XML-DEV: subscribe at:– [email protected]

Copyright 2001, ActiveState

General XML

• Definitive Spec.– http://www.w3c.org/TR/xml-spec.html

• Annotated Spec.– http://www.xml.com/xml/pub/axml/axmlintro.html

• FAQ : – http://www.ucc.ie/xml

• Definitive Refererence to all things XML– http://www.oasis-org.org/sgml