10 mistakes you can made with xml and perl10 mistakes you can made with xml and perl ... xml...

Post on 27-Jun-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

YAPC::CA, May 2003 Adler - 10 Mistakes with XML and Perl �

10 mistakes you can made with XML and Perl

“more bugs, more quickly”

Andy Adler

Remember: It’s all fun and games …until someone laughs their head off.

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

What is XML?

� Andy's Answer: �ASCII syntax for data structures

� XML looks like HTML�However, conceptually very different

� Fairly simple spec�“Tons” of associated specs

� Standards mostly controlled by fairly sane people�Broad support from open source to MS

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Simple XML example

���������������� ���������������������������� �����������

������������������������ ��� ������������ ������������������������������������������� ��� ������������ ��� ��

����������

������������������

Tag Attribute

PCDATA

End Tag

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

More compilated example

� File format for StarOffice 6.0 (ie openoffice.org) is XML based

� Text sections represented in XML Images are referenced. File is zipped.

� Used the following document:�������������������������� ������!�������

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

�"���#���������#���������#���������#������ �� $� ����%�������%�������%�������%��� �&'()*�"��+,�-'./0,�-'./0,�-'./0,�-'./0 ������1%�������)��������/&234-/&234-/&234-/&234-�)������������ �����,',�������,��������� $��05�������� %�%��

�������1%�������%�������%�������%�������))))��������������������������������������������1������������������������ �����1������������ �����$$$������������������������1������������ �����1������������ �����$$$���������������������1�������� �����1��666 67 �����888�9:3�(���������������������1���!���!���!���! �����1��666 67 �����888����!�����������������1������������ �����1��666 67 �����88*�;���;��;3�������������������������1���������������� �����������������������������1#������#������#������#������ �� $���������1��������������������������

More complicated exampleDocument

TypeDeclaration

XMLNamespaces

usage

Definition ofNamespace

“office”

ProcessingInstruction

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

������1����������������))))%���%���%���%��� ��������������������1������������ �<�=�����������1����������������))))�������������������� �<�=�����������������������1����������������))))��������������������))))���������������������������� ��6�������������������������1����������������))))�������������������� �#��=�����

��������1����)%������������1����������������������������))))���������������������������������1=�%��

����1��>�������>�������>�������>�����)%���������1��>�������>�������>�������>�����))))%���%���%���%���

������������1%�����%�����%�����%�����))))����������������������������))))��#����#����#����#�� �$�������������1������������ �'�����

�����1��>�����)%���������1���� ������������1��������������������))))������������ ���%������

������������1��#����#����#����#�� �������������������������� ������!��

�����1����������1=�%��

��������1%�������)��������

More compilated example

XMLcolouringcourtesy

of vim

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

This is too easy …

� I promised that I’d help you write buggy XML code

� But this looks way too easy.� How can we write

� Hard-to-debug� Job-security-enhancing� Happen-only-occasionallybugs with this stuff?

Just wait, my friends …

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bug #1:Confusing Attributes and Data � Some people get worried when they need

to create a new XML format to encode their data:

� Should I put the information in�attributes or� text (PCData)?

Adler - 10 Mistakes with XML and Perl YAPC::CA, May 2003

Bug #1:Confusing Attributes and Data � Should I say:

���%6����������%����������������� ��������

�������������������� �����%�������%6����

Or���%6����

������%���������������������������������%��������

�������%���������%6�����

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bug #1:Confusing Attributes and Data

� Answer:Both

� In fact, alternate between them.� After all, there’s more than one way to do it

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #2: Abusing bloated and buggy standards � Example #1: SOAP:

�"���#������ ?� $?�����%��� ?&'()*?"��:�</)05@10�#����������1:�</)05@

�����1�������� ����� ����������#�������������1�� �����1��666 67 �����888�9;3:����)���������������1�% �����1��666 67 �����888�9;3:������

�:�</)05@12�%������1�����������1��� ����1������)���#����

:�</)05@1����%���:���� �����1�������� ����� ������������%������

������%)���������1��� �����1�������� ����� ������������%�����

��1���� ����1<��������1��7� �����1&:0A�5<;0�:/<-0����1���'��� ���71-����%B�$C��

��������1���� ���71�&A�,<'<��0A0�&A�,<'<��0A0�&A�,<'<��0A0�&A�,<'<��0A0��

Finally,SomethingWe careabout

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #2: Abusing bloated and buggy standards � Example #2: MathML:

�"���#������ �� $������%��� ����)**D8)��"���������� �����1��666 67 �����88*�;���;��;3��

����(���������6��������E�����

��������������F�����

�����6����� �������������

��������������6�

��������������G��H�����

�����6��������

All this XML gets us to the fraction

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #2: Abusing bloated and buggy standards � Lots more examples:

�XML Schema�RDF�Open office format

� Unfortunately, some aren’t too bad�SVG <= from Adobe no less!

Luckily, these are few

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #2: Abusing bloated and buggy standards � There are 2 ways to parse these bloated

documents�Reach into the data structure and grab the

elements you care aboutOr�Carefully parse the structure and check

everything

� The best part is: They’re both wrong!

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Approach #1: only grab the elements you care about� Consider this document����������� �������I�

������%�������� ��������=���I��

���������

� I parse with “/recipe/ingredient[1]@name”This is XPath syntax � later

� Now, we get the following document instead:

����������� ������I�

������%�������� ������J� =���I��

���������

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Approach #2: Carefully parse the structure and check everything� Our server carefully parses this SOAP message�:�</)05@10�#����� ����1�%

�����1��666 67 �����888�9;3:������

� Then one day, our application breaks,after lots of debugging, the client now sends:

�:�</)05@10�#����� ����1�% �����1��666 67 �����$$��$$��$$��$$��9;3:������

� How the *&^%^& is our application supposed to know that the 2001 and 1999 schema specs are essentially identical?

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #3:Abusing bloated and buggy tools � The basic XML API is the Document

Object Model (DOM)� Here’s a sample

���K��%��� �K%��)����0�������2�'�5�� E�-�,02<:0�FH

���K�� �K��%��)����3�����H����E���K�� �$H�K����K�H�K�LLF���M

���K��%�� �K��%��)������EK�FH���K����� �K��%�)�

���<����=���5�%� E��A0(�FH������K����)����@���� ��N��H

O

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

What’s wrong with DOM

� Simple answer:

It goes against everything Perl stands for

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

What’s wrong with DOM

� Lets compare long and painful XML API’s to string parsing in C

� The fact that its painful, tends to encourage people to take shortcuts for micro-optimization

� Example:� �P�A���#�� ��������������������������P�

� �������B�������E�������F)Q�C� $H

� Correct Solution:� K�������� R��M�N ���KOMOH

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

What’s wrong with DOM

� This kind of shortcut is:�Hard to read�Hard to write�Tends to result in other bugs�Leaks of pointers, variables

� And worst of all�When your concentration is forced onto

painful code, you loose cognitive momentumTM

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #4: How to confuse validity and well-formedness� Naturally, one would expect that a “valid”

document conforms to the specification.Wrong!

� A document that conforms to the specification is called “well-formed”

� A “valid” document conforms to its DTD�or maybe its schema,�or maybe its RELAX spec (or any one of the

many ways to specify the document syntax)

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

How is this a bug?

� Valid XML documents are rare� Although in theory, DTD’s etc. are a good idea,

I’ve rarely seen them in practice.Reasons include:� Industry time pressures� General sloppy thinking about testing� Lack of tool support� Specification overload: DTD’s were painful, but the

W3C replacement is worse� Important specs like SOAP don’t support validation

� Therefore:Almost any XML document is invalid

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

More about validity

� Here’s some interesting information about SOAP�A�����S�����B;:C����6�����������T��������� �����

6������������������6�1 �'����>��������������:�</����%�%�����������������1��<�:�</��������;&:'�5�'���������,���������'����,�������� �<�:�</���������;&:'�5�'��������/����������4�����������

� Therefore SOAP requires invalid XML

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #5:Unreliable parsing with regexps� You may think that there should be some

good packages out there to parse XMLYou’re right

� There are lots of API’s to parse XML:� DOM� SAX� JDOM (Java DOM)

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #5:What about perl API’s?� There are lots and lots of perl API’s to

parse XML:

Meta::Lang::XmlXML::CheckerXML::ValidWriter

PApp::XMLXML::GeneratorXML::Xerces

XML::DOMXML::RAXRPC::XML::Parse

XML::TemplateXML::Grove::SubstXML::FOAF

XML::YYLexXML::XimpleXML::SAX::ExpatXML::SAX::BaseXML::MiniXML::GDOMEDBIx::XMLXML::Filter::DigestXML::TwigXML::LibXMLXML::HandlerXML::ConfigXML::SimpleXML::DumperXML::SAX::PurePerl

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #5:What about Perl API’s? There are lots and lots and lots of perl API’s

to parse XML:

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #5:Unreliable parsing with regexps� Many people have thrown in the towel and

said *&^%$ it.�XML is text�Perl parses text�Therefore: I’ll use a regular expression

� Even Tim Bray (a founder of XML) admits to this � www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #5:Unreliable parsing with regexps� However, parsing balanced text with regexps is

(deliciously) unreliable� Consider:���%6����

������%� ������������ ���������������������������� �����%���

����%6����� Parsing:������%� ��� �BN�CP������� �BN�CP��HK���� �K�H�K������ �K�H� Problem: attribute order is not part of XML infoset

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bug #6:Unreliable encoding with printf� Example: encode passport data in XML� Code:������ ����������U�����������IVK%�H

� Problem: my passport data isK%� �-<5<,30A��<5,A0S�����H

� So now XML is:���������

-<5<,30A��<5,A0S����

�����������

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bug #6:Unreliable encoding with printfUnfortunately, there are good ways to get

around it in perl� Quote your <&> tags:

����NG��H��

� Use a module: �XML::Twig�XML::Writer

� Or use CDATA������������B+-,<'<B��

K%�� ��CC������IH

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #7:Parsing to make your code crawl

� You say: use XML::DOMor use XML::Simple

� which says: use XML::Parser� which says: use Expat� which uses: expat.xs� which parses the text

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #7:Parsing to make your code crawl

� Expat is a standard API for lots and lots of XML code: Perl, Apache, Python, Mozilla

� Its fast and well understood� History: Created by James Clark, a British expat living

in Thailand� Expat has some weirdnesses

� It doesn’t validate� API calls don’t necessarily return full data elements� Statically linked with patches to many modules

� Using XML::Parser (or better expat.pm) directly is a great source of code pain

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #8: Parsing to make your memory thrash � Take this presentation, convert to openoffice.org format.K���W������ �)9;3��! ��� ������� ��

K��� )����� �)9;3��! ��� ������� ���8�7D*�������� ��QX��Q���� �)9;3��! ���

� Now, lets parse this with XML::SimpleK����� );9;311:�����

)�?K� Y���������� ��YH?)�?K� 9;3��EK�FH?H

� Size of $r is 4656k (cygwin perl, win2k )� zipped data = 1� XML format = 6.2 times bigger� Perl objects = 100 times bigger

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #8: Parsing to make your memory thrash � A common complaint is that XML is

bloated data format. It’s true

� However, compressing XML is good�Often *.xml.bz2 is smaller than an equivalent

bespoke binary format

� The real bloat happens in the object representation from the parser

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #9: How to abuse broken name spaces support� Name space support is kind of like using a

local variable in perl�����K��� �Z

K���)�M�����%O �Z

� XML Example�:�</)05@10�#����� ����1:�</)05@ �

�����1�������� ����� ����������#�����������:�</)05@12�%���

� The use of “SOAP-ENV”, should be an arbitrary (if usual) choice

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #9: How to abuse broken name spaces support� Fortunately, for very many

implementations (especially of SOAP), the namespace is hardcoded

� This means that you can have:�XML documents that meet the spec that don’t

work with the parser�XML documents that don’t meet the spec (ie.

have the namespace point elsewhere) and are still accepted and processed

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bug #10: How to modify your XML during validation� Here’s an underexploited bug: Did you

know that the “validation” process is allowed to change the XML document?

� The DTD (or schema) is allowed to specify default attiributes that are inserted into the document during validation

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #1: Store information outside the infosetInformation that is not in the infoset should not

make any difference� Character encoding

� Example: é vs. &eacute;� PCDATA vs CDATA� Whitespace in elements

� Example <foo att=“bar”> vs.<foo att = “bar” >

� Order of attributes� Example <foo att1=“bar1” att2=“bar2”>

<foo att2=“bar2” att1=“bar1”>

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bonus Bug #1: Store information outside the infosetNon infoset information, cont’d� Entities vs PCDATA� Namespace names

� SOAP Example:�:�</)05@10�#����� ����1:�</)05@ �

�����1�������� ����� ����������#�����������:�</)05@12�%���

versus�A�/0)05@10�#����� ����1A�/0)05@ �

�����1�������� ����� ����������#�����������A�/0)05@12�%���

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Bonus Bug #2: Calling. MethodsWithNamesThatAreWayTooLongAndMakeCodeHardToRead();

I call this effect JAVA itis

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #3: Backwards thinking with callbacks

Warning: code guru’s, plug ears now.

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #3: Backwards thinking with callbacks� Most of the sophisticated XML API’s require you

to use callbacks� Example (with XML::Parser, others are worse)����9;311/����H

K� ��6�9;311/����E��%���� ��M:���� ��NG��%�������V0�%� ����NG��%�����%V-��� ���NG��%������OFH

K�)�����E?���� �% ������S������?FH�

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #3: Backwards thinking with callbacks� Now we just need to write callback functions��=���%������� M[�:�����1������������������[�:�����1�����������6�����6����

O

� Keeping track of state in the callback function is a real pain.� We don’t have a proper object to store it� Tend to use hacks like static variables� Makes us do unnecessary mental gymnastics

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #4: Broken tools

� MSXML didn't work at all well until version 3. It still doesn't handle external entities properly

� No Debuggers for XML parsers. �Now activestate has a visual XSLT

� No good tools for writing DTDs and schemas. So most people don't bother

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Bonus Bug #4: Broken tools in perl

� Lets try to install SOAP::Lite from CPAN� Just to be clear SOAP::Lite is a great module

� Look at prerequisites� :�</113���:�</113���:�</113���:�</113���

� ;4;0113���;4;0113���;4;0113���;4;0113���� ;4;011/����;4;011/����;4;011/����;4;011/����

� ;4;011'����;4;011'����;4;011'����;4;011'����� IO::Stringy� Mail::Field� Mail::Header� Mail::Internet

� 9;311/����9;311/����9;311/����9;311/����� HTTP::Daemon� LWP::UserAgent� URI

ceteraet

��

��

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Actually, there’s lots good about XML. slashdot.org quote:

Re:But XML is great for computers... (Score:5, Insightful) by Ed Avis (5917) <ed@membled.com> on Tuesday March 18, @08:48AM (#5535893)You mean like most other non-xml config files in /etc, like say hosts, DNS zone files, named.conf, passwd/shadow, hosts.allow/deny, sendmail.mc or resolv.conf (etc. etc.)? These have standard layouts, text-based, can be edited by hand and can be easily parsed.You just gave the best argument for adopting XML as widely as possible. Yes, all these can be parsed (with the possible exception of sendmail'sconfig files which may be Turing-complete) but they all require *different*code for each config file. If they were in XML you'd still need different semantic code, of course, but a whole wodge of syntax issues (how do I quote strings, how do I escape newlines, how do I mark nested scopes, what happens when the string delimiter character occurs inside a string, how do I deal with comments, what is the character set, is there a formal grammar for the document, etc etc) would be dealt with. … But they would be dealt with *once*. No need to learn a new or almost-the-same-but-slightly-different set of syntactic conventions for every single config file.

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Non Bug #1: XML is text

� Unfortunately, XML doesn’t have a bizarre binary encoding like ASN-1 (or msword).

� Even worse, XML has a consistent, clean implementation of Unicode UTF-8.� Unlike what JAVA or MS say, Unicode is not a 2

byte character, that’s the BMP-1 subset of UTF-16� UTF-8 is compatible with ASCII, and allows C style

strings (with null termination)� Your best bet for bugs is to badly integrate

broken unicode support in JAVA� Unfortunately, this talk is about Perl

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Non Bug #2: clean separation of syntax and semantics� Unfortunately, XML is now quite well

understood to be a data structure�Some think it’s a programming language�Some think it’s HTML

� Rigid definition of well-formedness�Unlike pdf, where Adobe writes a spec, but

actually parses documents differently�Style sheets allow conversion of XML data to

different presentation formats (and are slow, slow, slow)

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

XML Advice

� Don't hand parse XML

� Be careful hand creating XML �need to “quote”, < , > , &

� Don't use XML::DOMDon't use XML::Parser

Adler - 10 Mistakes with XML and Perl �YAPC::CA, May 2003

Advice: use XML::Simple

� Use XML::SimpleK���������� �9;3��E

������������������F[���#����������������%�����������K���������� �9;3����E�K�������������F

� Good for small XML documents.� Lots of options to control how structure is

created. I don't like the default options

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Advice: use XML::XPath::Simple

� Xpath is like a regexp language for XML

����9;3119/��11:�����H

K� ���6�9;3119/��11:�����E��� ��K��������V������� ��?�?�FH

K�������� �K�)�#�����E

?�%����B�C�%B�CT�%?FH

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Future of XML and Perl

� XML will not be well used until there is good in-language support (à la Perl regexps)

� It must be easier to use XML than to invent an *ini style file format

� Require “under the covers” caching� Perl XML support is too scattered -we need

less and better modules.� The Perl way is/should be to make it easy to

right thing, the Java way is to forbid the wrong thing. We must perlize XML.

Adler - 10 Mistakes with XML and Perl ��YAPC::CA, May 2003

Ten Mistakes with XML and Perl, Andy Adler, YAPC::CA, Ottawa, May 15, 2003

Abstract: XML will enable universal data interoperability by providing a vendor independent, OSI level 7 presentation layer, semantic capable, interoperable data format. Since XML is so easy to use, it is unclear why there is a need for talks on how to use it. Fortunately, with the power of Perl, and a few helpful tips, anyone can add hard to track bugs to their XML handling code. We will focus on the following techniques:

� Confusing Attributes and Data � Abusing bloated and buggy standards � Abusing bloated and buggy tools � How to confuse validity and well-formedness� Unreliable parsing with regexps� Unreliable encoding with printf� Parsing to make your code crawl � Parsing to make your memory thrash � How to abuse broken name spaces support� How to modify your XML during validationNow, new and improved, including 4 bonus bugs.

top related