sdpl 2002notes 8: xml wrapping1 8 translating data to xml n how to translate existing data formats...
TRANSCRIPT
SDPL 2002 Notes 8: XML Wrapping 1
8 Translating Data to XML8 Translating Data to XML
How to translate existing data formats to XML?How to translate existing data formats to XML?– (and why?)(and why?)
XW (XML Wrapper)XW (XML Wrapper)– an "XML wrapper description language"an "XML wrapper description language"– developed in XRAKE project, Univ. of Kuopio, 2001–02developed in XRAKE project, Univ. of Kuopio, 2001–02– Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttonen: Describing Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttonen: Describing
XML Wrappers for Information Integration. In Proc. of XML Wrappers for Information Integration. In Proc. of XML XML Finland 2001Finland 2001, Tampere, Finland, Nov. 2001, 38–51., Tampere, Finland, Nov. 2001, 38–51.
SDPL 2002 Notes 8: XML Wrapping 2
XRAKE ProjectXRAKE Project
""XXML-ML-rarajapintojen japintojen kekehittäminen" hittäminen" (Developing XML-based interfaces)(Developing XML-based interfaces)
Studies Studies definition and implementation of XML-definition and implementation of XML-based interfaces, and their application inbased interfaces, and their application in– integration of heterogeneous data sourcesintegration of heterogeneous data sources– management of mass printingmanagement of mass printing– assembly and manipulation of electronic assembly and manipulation of electronic
patient recordspatient records
SDPL 2002 Notes 8: XML Wrapping 3
XRAKE - SupportXRAKE - Support
National Technology Agency of Finland (TEKES) and National Technology Agency of Finland (TEKES) and seven local IT companies/organizations seven local IT companies/organizations – DEIO ISDEIO IS– Enfo GroupEnfo Group– JSOP InteractiveJSOP Interactive– Kuopio University HospitalKuopio University Hospital– MedigroupMedigroup– SysOpenSysOpen– TietoEnatorTietoEnator
SDPL 2002 Notes 8: XML Wrapping 4
XW: MotivationXW: Motivation
XML-based protocols developed for XML-based protocols developed for e-business, medical messages, … e-business, medical messages, …
Legacy data formats need to be converted Legacy data formats need to be converted to XMLto XML– How?How?
SDPL 2002 Notes 8: XML Wrapping 5
XML-wrappingXML-wrapping
Need ”Need ”XML-wrappersXML-wrappers” (aka ” (aka extractorsextractors))– interface/conversion program to produce an interface/conversion program to produce an
XML representation for source datgaXML representation for source datga
source1source1
source2source2
source3source3
wrapperwrapper11
XML-form-XML-form-11
wrapperwrapper22
wrapperwrapper33
XML-form-XML-form-22
SDPL 2002 Notes 8: XML Wrapping 6
How to wrap?How to wrap?
1. With an interface integrated to source1. With an interface integrated to source– E.g. XML-interfaces of database systemsE.g. XML-interfaces of database systems– OK, OK, ifif available available
2. With an ad-hoc written translator2. With an ad-hoc written translator– E.g. JDBC+Java or E.g. JDBC+Java or
separator-encoded text form + Perlseparator-encoded text form + Perl– OK; conversion possibly efficientOK; conversion possibly efficient– Development and maintenance tediousDevelopment and maintenance tedious
:: -(-(
SDPL 2002 Notes 8: XML Wrapping 7
How to wrap?How to wrap?(2)(2)3. Generic source-independent 3. Generic source-independent
wrapping wrapping – requires a file/message/report produced requires a file/message/report produced
by the systemby the system» normally should be availablenormally should be available
– with a proper methodology development with a proper methodology development and maintenance should become easierand maintenance should become easier
=> Wrapper description language XW=> Wrapper description language XW
SDPL 2002 Notes 8: XML Wrapping 8
XW (XML Wrapper)XW (XML Wrapper)
XML-based, declarative wrapper XML-based, declarative wrapper description languagedescription language
To convert from a To convert from a – textual or binary sourcetextual or binary source
to XML formto XML form
SDPL 2002 Notes 8: XML Wrapping 9
XW: Design principlesXW: Design principles
A concise and natural XML syntaxA concise and natural XML syntax– description of simple and typical conversion description of simple and typical conversion
tasks should be simpletasks should be simple Solving the key problem: Initial conversion Solving the key problem: Initial conversion
of a legacy data format to XMLof a legacy data format to XML– more general post-processing with more general post-processing with
XSLT/SAX/ DOMXSLT/SAX/ DOM– necessary for being able to apply XML necessary for being able to apply XML
techniquestechniques
SDPL 2002 Notes 8: XML Wrapping 10
XW: InfluencesXW: Influences
XML NamespacesXML Namespaces
– for separating XW commands and result elementsfor separating XW commands and result elements XML SchemaXML Schema
– description of alternative and repetitive structures description of alternative and repetitive structures (CHOICE, minoccurs, maxoccurs)(CHOICE, minoccurs, maxoccurs)
– data types of binary source data data types of binary source data (string, byte, int, …)(string, byte, int, …)
XSLTXSLT– template-based description of result documentstemplate-based description of result documents
xmlns:xw=”http://www.cs.uku.fi/XW/2001” xmlns:xw=”http://www.cs.uku.fi/XW/2001”
<xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" <xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" >xmlns:xw="http://www.cs.uku.fi/XW/2001" > <invoice <invoice xw:starter="\^xw:starter="\^INVOICEINVOICE" xw:maxoccurs="unbounded"" xw:maxoccurs="unbounded">> <identifierdata ...><identifierdata ...> ...... </identifierdata></identifierdata> <specification <specification xw:starter="\^xw:starter="\^PHONE SPECIFICATIONPHONE SPECIFICATION"" ...> ...> ...... </specification></specification> <invoicedata <invoicedata xw:starter="\^xw:starter="\^--------------------"" ...> ...> ...... </invoicedata></invoicedata> </invoice></invoice></xw:wrapper></xw:wrapper>
How does XW look like?How does XW look like?
SDPL 2002 Notes 8: XML Wrapping 12
XW-architectureXW-architecture (1)(1)
AA AA x1x1x2x2
BBBBy1y1 y2y2
z1 z1 z2z2
<part-a> <part-a> <e1> <e1>x1x1</e1> </e1> <e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1> <d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>
XW-engineXW-engine
<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>
source source datadata
wrapper wrapper descriptiondescription
result result documentdocument
XSLT
post-post-processingprocessing
SAX
DOM
SDPL 2002 Notes 8: XML Wrapping 13
XW-architecture (2)XW-architecture (2)
AA AA x1x1x2x2
BBBBy1y1 y2y2
z1 z1 z2z2XW-engineXW-engine
<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>
startElement(part-a, …) startElement(part-a, …) startElement(e1, …) startElement(e1, …) characters(”characters(”x1x1”)”)……
SAX eventsSAX events
source source datadata
wrapper wrapper descriptiondescription - to use as a - to use as a
program program componentcomponent
SDPL 2002 Notes 8: XML Wrapping 14
XW-architecture (3)XW-architecture (3)
ap
plica
tioap
plica
tionn
XW
-engin
eX
W-e
ngin
e
SA
XS
AX
AA AA x1x1x2x2
BBBBy1y1 y2y2
z1 z1 z2z2
<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>
<part-a> <part-a> <e1> <e1>x1x1</e1> </e1> <e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1> <d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>
result result documentdocument
source source datadata
wrapper descriptionwrapper description
SDPL 2002 Notes 8: XML Wrapping 15
Wrapper description ~ aWrapper description ~ a grammar grammar for sourcefor source Wrapping ~ Wrapping ~ parsing parsing the source datathe source data
– split data into parts split data into parts according to the according to the descriptiondescription
XW: Basic IdeasXW: Basic Ideas
– Result document = Result document = XML for the parse tree XML for the parse tree of the sourceof the source
subpart-1 …subpart-1 …
the-whole …the-whole …part-X …part-X …
part-Y …part-Y …
subpart-2 …subpart-2 …
subpart-1 …subpart-1 …
SDPL 2002 Notes 8: XML Wrapping 16
XW SyntaxXW Syntax
<xw:wrapper xw:sourcetype=”text” <xw:wrapper xw:sourcetype=”text”
xmlns:xw=”http://www.cs.uku.fi/XW/2001”>xmlns:xw=”http://www.cs.uku.fi/XW/2001”> <invoice … > <invoice … > <identifierdata ...><identifierdata ...>
......
</identifierdata></identifierdata>
<specification ...> <specification ...>
......
</specification> </specification> </invoice> </invoice></xw:wrapper></xw:wrapper>
Splitting of Splitting of source content source content
into partsinto parts(-> elements)(-> elements)
SDPL 2002 Notes 8: XML Wrapping 17
Recognition of content parts (1)Recognition of content parts (1)
by by separatorsseparators; For example:; For example:<invoice xw:starter="\^INVOICE”…<invoice xw:starter="\^INVOICE”…
by positionby position (within surrounding part): (within surrounding part):
<invoicenumber <invoicenumber xw:position="53 64"/> xw:position="53 64"/>
(Invoice number is in positions 53..64 of the (Invoice number is in positions 53..64 of the first row of an first row of an identifierdataidentifierdata-part)-part)
<identifierdata <identifierdata xw:childterminator="\n” ...xw:childterminator="\n” ...
for for sub-sub-partsparts
SDPL 2002 Notes 8: XML Wrapping 18
Recognition of content parts (2)Recognition of content parts (2)
In binary data by In binary data by content data typescontent data types; ; For example:For example:
<xw:wrapper xw:sourcetype="binary”...><xw:wrapper xw:sourcetype="binary”...>
<A xw:type=”byte"/><A xw:type=”byte"/><B xw:type="string" <B xw:type="string"
xw:stringLength=”20"/>xw:stringLength=”20"/><C xw:type=”int"/> <C xw:type=”int"/>
</xw:wrapper></xw:wrapper>– Split input to a byte, a string of 20 charactes, and Split input to a byte, a string of 20 charactes, and
an integer; (an integer; ( elements elements AA,, BB andand CC))
SDPL 2002 Notes 8: XML Wrapping 19
Repetition:Repetition: <line xw:terminator="\n" <line xw:terminator="\n"
xw:minoccurs="2" maxoccurs="2"/>xw:minoccurs="2" maxoccurs="2"/>
– 2 rows 2 rows 2 2 lineline elements elements
Recognition of content parts (3)Recognition of content parts (3)
Alternative parts:Alternative parts:
<xw:CHOICE xw:maxoccurs=”unbounded"><xw:CHOICE xw:maxoccurs=”unbounded"> <A xw:starter=”\^aa” xw:terminator=”\n” /> <A xw:starter=”\^aa” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” />
</xw:CHOICE></xw:CHOICE> – arbitrary number (at least 1) lines starting with ”arbitrary number (at least 1) lines starting with ”aaaa” ”
or ”or ”bbbb” ” elements elements AA or or BB
SDPL 2002 Notes 8: XML Wrapping 20
XW: Modifying the structure of XW: Modifying the structure of datadata Limited modification possible:Limited modification possible:
– discarding parts of datadiscarding parts of data– collapsing levels of hierarchycollapsing levels of hierarchy– adding levels of hierarchyadding levels of hierarchy
Not supported (yet):Not supported (yet):– generating new datagenerating new data– re-arranging existing datare-arranging existing data
SDPL 2002 Notes 8: XML Wrapping 21
Discarding parts of dataDiscarding parts of data
<spec xw:starter="SPEC” <spec xw:starter="SPEC”
xw:childterminator="\n">xw:childterminator="\n">
<!-- Split the ”SPEC” into rows: --><!-- Split the ”SPEC” into rows: -->
<!-- Ignore the first three rows: --><!-- Ignore the first three rows: -->
<xw:ignore <xw:ignore xw:minoccurs=”3" xw:minoccurs=”3" xw:maxoccurs=”3" />xw:maxoccurs=”3" />
. . .. . .
</spec></spec>
SDPL 2002 Notes 8: XML Wrapping 22
Collapsing hierarchyCollapsing hierarchy
<data<data xw:starter=”START”xw:starter=”START” xw:terminator=”END” xw:terminator=”END” xw:childterminator="\n”> xw:childterminator="\n”>
<!-- ’data’ is made of rows --><!-- ’data’ is made of rows -->
<xw:collapse><xw:collapse>
<date xw:position=”5 14"/><date xw:position=”5 14"/>
<sum xw:position=”16 21"/><sum xw:position=”16 21"/></xw:collapse></xw:collapse>
. . .. . .
</data></data>
SDPL 2002 Notes 8: XML Wrapping 23
Collapsing hierarchy Collapsing hierarchy (2)(2)
– Split source data into Split source data into parts according to parts according to specified separatorsspecified separators
STARTSTART 17.8.199617.8.1996 95.5095.50
ENDEND
<data><data>
. . .. . .
</data></data>
SDPL 2002 Notes 8: XML Wrapping 24
<data><data> <xw:collapse<xw:collapse>> </xw:collapse</xw:collapse>> . . . . . .</data></data>
Collapsing hierarchy Collapsing hierarchy (3)(3)
– split parts into sub-split parts into sub-parts, according to parts, according to sub-elementssub-elements
17.8.199617.8.1996 95.5095.50
SDPL 2002 Notes 8: XML Wrapping 25
<data><data> <xw:collapse<xw:collapse>>
<date><date> </date></date> <sum><sum> </sum></sum> </xw</xw::collapsecollapse>> . . .. . .</data></data>
Collapsing hierarchy Collapsing hierarchy (4)(4)
17.8.199617.8.199695.5095.50
<data><data> <date><date> </date></date> <sum><sum> </sum> </sum> . . . . . .</data></data>
17.8.199617.8.199695.5095.50
SDPL 2002 Notes 8: XML Wrapping 26
Adding levels of hierarchyAdding levels of hierarchy
Example: Recognizing IP addresses in Example: Recognizing IP addresses in binary databinary data
<xw:ELEMENT xw:name=”IP-address"><xw:ELEMENT xw:name=”IP-address">
<a xw:type="byte"/> <a xw:type="byte"/> <b xw:type="byte"/> <b xw:type="byte"/>
<c xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/><d xw:type="byte"/>
</xw:ELEMENT></xw:ELEMENT>
SDPL 2002 Notes 8: XML Wrapping 27
Adding levels of hierarchy Adding levels of hierarchy (2)(2)– Binary data = string Binary data = string
of bytesof bytes
<a>193</a> <a>193</a> <b>167</b> <b>167</b> <c>232</c> <c>232</c> <d>253</d><d>253</d>
193193 232232167167 253253
<IP-address> <IP-address>
<a>193</a> <a>193</a>
<b>167</b> <b>167</b> <c>232</c> <c>232</c> <d>253</d> <d>253</d> </IP-address> </IP-address>
SDPL 2002 Notes 8: XML Wrapping 28
Adding levels of hierarchy Adding levels of hierarchy (3)(3) NB: an NB: an xw:ELEMENT xw:ELEMENT does not correspond to parts of does not correspond to parts of
input data (like ordinary result elements do):input data (like ordinary result elements do):
<!-- Wrap first two lines as INTRO: --><!-- Wrap first two lines as INTRO: --><data xw:childterminator="\n"/><data xw:childterminator="\n"/> <xw:ELEMENT xw:name="INTRO"> <xw:ELEMENT xw:name="INTRO">
<!--parts recognised by childterminators:--><!--parts recognised by childterminators:--> <xw:collapse /><xw:collapse /> <xw:collapse /><xw:collapse />
</xw:ELEMENT></xw:ELEMENT> … …
</data></data>
SDPL 2002 Notes 8: XML Wrapping 29
XW: ImplementationXW: Implementation
Prototype implemented with JavaPrototype implemented with Java Apache Xerces 2.0.1 used as a SAX parserApache Xerces 2.0.1 used as a SAX parser
– to read the wrapper description, to read the wrapper description, which is represented internally as ..which is represented internally as ..
a a wrapper treewrapper tree– guides the parsing of source dataguides the parsing of source data
SDPL 2002 Notes 8: XML Wrapping 30
Wrapper TreeWrapper Tree
Wrapper tree nodeWrapper tree node– corresponds to an element of wrapper descriptioncorresponds to an element of wrapper description– used for matching parts of source dataused for matching parts of source data– includes sets includes sets SS, , BB, , EE and and F F of stringsof strings
» computed from wrapper descriptioncomputed from wrapper description» SS: element's own : element's own sstarter stringstarter strings» BB: strings that can : strings that can bbegin part of elementegin part of element» EE: strings that can : strings that can eend part of elementnd part of element» FF: strings that can : strings that can ffollow the part of elementollow the part of element
SDPL 2002 Notes 8: XML Wrapping 31
<xw:wrapper xw:name="Wrapper tree example"<xw:wrapper xw:name="Wrapper tree example" xw:sourcetype="text"xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001">xmlns:xw="http://www.cs.uku.fi/XW/2001"> <doku xw:childterminator="<doku xw:childterminator="\n\n" terminator="" terminator="$$">"> <a xw:starter="<a xw:starter="\^A\^A" xw:minoccurs="0"/>" xw:minoccurs="0"/> <b xw:starter="<b xw:starter="\^B\^B" xw:minoccurs="0"/>" xw:minoccurs="0"/> <c xw:starter="<c xw:starter="\^C\^C"/>"/> <xw:CHOICE xw:minoccurs="0"<xw:CHOICE xw:minoccurs="0" xw:maxoccurs="unbounded">xw:maxoccurs="unbounded"> <d xw:starter="<d xw:starter="\^D\^D"/>"/> <e xw:starter="<e xw:starter="\^E\^E"/>"/> </xw:CHOICE></xw:CHOICE> </doku></doku></xw:wrapper></xw:wrapper>
dokudokuS: S: B:B:\^A \^A ,,\^B\^BE: E: $$ F:F:
bbS:S:\^B \^B B:B:\^B\^BE:E:\n\n F:F:\^C\^C
aaS:S:\^A\^A B:B:\^A\^AE:E:\n\n F:F:\^B\^B,,\^C\^C
xw:CHOICExw:CHOICES: S: B:B:\^D\^D,,\^E\^E E: E: F:F:\^D\^D,,\^E\^E, , $$
ccS:S:\^C\^C B:B:\^C\^CE:E:\n\n F:F:\^D\^D,,\^E\^E, , $$
ddS:S:\^D\^D B:B:\^D\^D E:E:\n\n F:F:\^D\^D,,\^E\^E, , $$
eeS:S:\^E\^E B:B:\^E\^E E:E:\n\n F:F:\^D\^D,,\^E\^E, , $$
AaaaAaaaBbbbBbbbCcccCcccEeeeEeeeDdddDdddDdddDddd
SDPL 2002 Notes 8: XML Wrapping 32
Executing a wrapper (simplified)Executing a wrapper (simplified)
Traverse the wrapper tree; In each node:Traverse the wrapper tree; In each node:– scan input until the start of corresponding part found scan input until the start of corresponding part found
(member of set (member of set BB))– report report startElement(…)startElement(…) – EitherEither
» process child nodes recursively, or process child nodes recursively, or
» report report characters(…)characters(…) for a leaf-level element for a leaf-level element
– scan input until the end of the part (using sets scan input until the end of the part (using sets EE and and FF))– report report endElement(…)endElement(…)– if node iterative, and a string in if node iterative, and a string in BB found, reprocess found, reprocess
nodenode
SDPL 2002 Notes 8: XML Wrapping 33
Development statusDevelopment status
Fall 2001: language designed from Fall 2001: language designed from concrete examplesconcrete examples
Spring 2002: Design of implementation Spring 2002: Design of implementation principles, implementationprinciples, implementation– wrapping of separator-based and wrapping of separator-based and
positional text data implementedpositional text data implemented– wrapping of binary data (and few other wrapping of binary data (and few other
details) unimplemented details) unimplemented
SDPL 2002 Notes 8: XML Wrapping 34
XW: Possible extensionsXW: Possible extensions
Generation of attributes and data Generation of attributes and data contentcontent
Re-arrangement of contentRe-arrangement of content Describing recursive (unlimited Describing recursive (unlimited
nesting) source structuresnesting) source structures=> recognizing LL(k) languages=> recognizing LL(k) languages
(Usefulness for wrapping data formats?)(Usefulness for wrapping data formats?)
SDPL 2002 Notes 8: XML Wrapping 35
SummarySummary
XW: a convenient "XML wrapper XW: a convenient "XML wrapper description language”description language”– for translating legacy data to XMLfor translating legacy data to XML– declarative wrapper descriptiondeclarative wrapper description– easier than develop and maintain ad-hoc easier than develop and maintain ad-hoc
conversion programsconversion programs– running prototype implementationrunning prototype implementation