sdpl 2002notes 8: xml wrapping1 8 translating data to xml n how to translate existing data formats...

35
SDPL 2002 Notes 8: XML Wrapping 1 8 Translating Data to 8 Translating Data to XML XML How to translate existing data formats How to translate existing data formats to XML? to XML? (and why?) (and why?) XW (XML Wrapper) XW (XML Wrapper) an "XML wrapper description language" an "XML wrapper description language" developed in XRAKE project, Univ. of developed in XRAKE project, Univ. of Kuopio, 2001–02 Kuopio, 2001–02 Ek, Hakkarainen, Kilpeläinen, Kuikka, Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttonen: Describing XML Wrappers for Penttonen: Describing XML Wrappers for Information Integration. In Proc. of Information Integration. In Proc. of XML Finland XML Finland 2001 2001 , Tampere, Finland, Nov. 2001, 38–51. , Tampere, Finland, Nov. 2001, 38–51.

Upload: charity-parsons

Post on 27-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

SDPL 2002 Notes 8: XML Wrapping 1

8 Translating Data to XML8 Translating Data to XML

How to translate existing data formats to XML?How to translate existing data formats to XML?– (and why?)(and why?)

XW (XML Wrapper)XW (XML Wrapper)– an "XML wrapper description language"an "XML wrapper description language"– developed in XRAKE project, Univ. of Kuopio, 2001–02developed in XRAKE project, Univ. of Kuopio, 2001–02– Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttonen: Describing Ek, Hakkarainen, Kilpeläinen, Kuikka, Penttonen: Describing

XML Wrappers for Information Integration. In Proc. of XML Wrappers for Information Integration. In Proc. of XML XML Finland 2001Finland 2001, Tampere, Finland, Nov. 2001, 38–51., Tampere, Finland, Nov. 2001, 38–51.

SDPL 2002 Notes 8: XML Wrapping 2

XRAKE ProjectXRAKE Project

""XXML-ML-rarajapintojen japintojen kekehittäminen" hittäminen" (Developing XML-based interfaces)(Developing XML-based interfaces)

Studies Studies definition and implementation of XML-definition and implementation of XML-based interfaces, and their application inbased interfaces, and their application in– integration of heterogeneous data sourcesintegration of heterogeneous data sources– management of mass printingmanagement of mass printing– assembly and manipulation of electronic assembly and manipulation of electronic

patient recordspatient records

SDPL 2002 Notes 8: XML Wrapping 3

XRAKE - SupportXRAKE - Support

National Technology Agency of Finland (TEKES) and National Technology Agency of Finland (TEKES) and seven local IT companies/organizations seven local IT companies/organizations – DEIO ISDEIO IS– Enfo GroupEnfo Group– JSOP InteractiveJSOP Interactive– Kuopio University HospitalKuopio University Hospital– MedigroupMedigroup– SysOpenSysOpen– TietoEnatorTietoEnator

SDPL 2002 Notes 8: XML Wrapping 4

XW: MotivationXW: Motivation

XML-based protocols developed for XML-based protocols developed for e-business, medical messages, … e-business, medical messages, …

Legacy data formats need to be converted Legacy data formats need to be converted to XMLto XML– How?How?

SDPL 2002 Notes 8: XML Wrapping 5

XML-wrappingXML-wrapping

Need ”Need ”XML-wrappersXML-wrappers” (aka ” (aka extractorsextractors))– interface/conversion program to produce an interface/conversion program to produce an

XML representation for source datgaXML representation for source datga

source1source1

source2source2

source3source3

wrapperwrapper11

XML-form-XML-form-11

wrapperwrapper22

wrapperwrapper33

XML-form-XML-form-22

SDPL 2002 Notes 8: XML Wrapping 6

How to wrap?How to wrap?

1. With an interface integrated to source1. With an interface integrated to source– E.g. XML-interfaces of database systemsE.g. XML-interfaces of database systems– OK, OK, ifif available available

2. With an ad-hoc written translator2. With an ad-hoc written translator– E.g. JDBC+Java or E.g. JDBC+Java or

separator-encoded text form + Perlseparator-encoded text form + Perl– OK; conversion possibly efficientOK; conversion possibly efficient– Development and maintenance tediousDevelopment and maintenance tedious

:: -(-(

SDPL 2002 Notes 8: XML Wrapping 7

How to wrap?How to wrap?(2)(2)3. Generic source-independent 3. Generic source-independent

wrapping wrapping – requires a file/message/report produced requires a file/message/report produced

by the systemby the system» normally should be availablenormally should be available

– with a proper methodology development with a proper methodology development and maintenance should become easierand maintenance should become easier

=> Wrapper description language XW=> Wrapper description language XW

SDPL 2002 Notes 8: XML Wrapping 8

XW (XML Wrapper)XW (XML Wrapper)

XML-based, declarative wrapper XML-based, declarative wrapper description languagedescription language

To convert from a To convert from a – textual or binary sourcetextual or binary source

to XML formto XML form

SDPL 2002 Notes 8: XML Wrapping 9

XW: Design principlesXW: Design principles

A concise and natural XML syntaxA concise and natural XML syntax– description of simple and typical conversion description of simple and typical conversion

tasks should be simpletasks should be simple Solving the key problem: Initial conversion Solving the key problem: Initial conversion

of a legacy data format to XMLof a legacy data format to XML– more general post-processing with more general post-processing with

XSLT/SAX/ DOMXSLT/SAX/ DOM– necessary for being able to apply XML necessary for being able to apply XML

techniquestechniques

SDPL 2002 Notes 8: XML Wrapping 10

XW: InfluencesXW: Influences

XML NamespacesXML Namespaces

– for separating XW commands and result elementsfor separating XW commands and result elements XML SchemaXML Schema

– description of alternative and repetitive structures description of alternative and repetitive structures (CHOICE, minoccurs, maxoccurs)(CHOICE, minoccurs, maxoccurs)

– data types of binary source data data types of binary source data (string, byte, int, …)(string, byte, int, …)

XSLTXSLT– template-based description of result documentstemplate-based description of result documents

xmlns:xw=”http://www.cs.uku.fi/XW/2001” xmlns:xw=”http://www.cs.uku.fi/XW/2001”

<xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" <xw:wrapper xw:name="phone-invoice" xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001" >xmlns:xw="http://www.cs.uku.fi/XW/2001" > <invoice <invoice xw:starter="\^xw:starter="\^INVOICEINVOICE" xw:maxoccurs="unbounded"" xw:maxoccurs="unbounded">> <identifierdata ...><identifierdata ...> ...... </identifierdata></identifierdata> <specification <specification xw:starter="\^xw:starter="\^PHONE SPECIFICATIONPHONE SPECIFICATION"" ...> ...> ...... </specification></specification> <invoicedata <invoicedata xw:starter="\^xw:starter="\^--------------------"" ...> ...> ...... </invoicedata></invoicedata> </invoice></invoice></xw:wrapper></xw:wrapper>

How does XW look like?How does XW look like?

SDPL 2002 Notes 8: XML Wrapping 12

XW-architectureXW-architecture (1)(1)

AA AA x1x1x2x2

BBBBy1y1 y2y2

z1 z1 z2z2

<part-a> <part-a> <e1> <e1>x1x1</e1> </e1> <e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1> <d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>

XW-engineXW-engine

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

source source datadata

wrapper wrapper descriptiondescription

result result documentdocument

XSLT

post-post-processingprocessing

SAX

DOM

SDPL 2002 Notes 8: XML Wrapping 13

XW-architecture (2)XW-architecture (2)

AA AA x1x1x2x2

BBBBy1y1 y2y2

z1 z1 z2z2XW-engineXW-engine

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

startElement(part-a, …) startElement(part-a, …) startElement(e1, …) startElement(e1, …) characters(”characters(”x1x1”)”)……

SAX eventsSAX events

source source datadata

wrapper wrapper descriptiondescription - to use as a - to use as a

program program componentcomponent

SDPL 2002 Notes 8: XML Wrapping 14

XW-architecture (3)XW-architecture (3)

ap

plica

tioap

plica

tionn

XW

-engin

eX

W-e

ngin

e

SA

XS

AX

AA AA x1x1x2x2

BBBBy1y1 y2y2

z1 z1 z2z2

<xw:wrapper … ><xw:wrapper … > … … </xw:wrapper></xw:wrapper>

<part-a> <part-a> <e1> <e1>x1x1</e1> </e1> <e2> <e2>x2x2</e2></e2></part-a></part-a><part-b><part-b> <line-1> <line-1> <d1> <d1>y1y1</d1> </d1> <d2> <d2>y2y2</d2></d2> </line-1> </line-1> <d3> <d3>z2z2</d3></d3></part-b></part-b>

result result documentdocument

source source datadata

wrapper descriptionwrapper description

SDPL 2002 Notes 8: XML Wrapping 15

Wrapper description ~ aWrapper description ~ a grammar grammar for sourcefor source Wrapping ~ Wrapping ~ parsing parsing the source datathe source data

– split data into parts split data into parts according to the according to the descriptiondescription

XW: Basic IdeasXW: Basic Ideas

– Result document = Result document = XML for the parse tree XML for the parse tree of the sourceof the source

subpart-1 …subpart-1 …

the-whole …the-whole …part-X …part-X …

part-Y …part-Y …

subpart-2 …subpart-2 …

subpart-1 …subpart-1 …

SDPL 2002 Notes 8: XML Wrapping 16

XW SyntaxXW Syntax

<xw:wrapper xw:sourcetype=”text” <xw:wrapper xw:sourcetype=”text”

xmlns:xw=”http://www.cs.uku.fi/XW/2001”>xmlns:xw=”http://www.cs.uku.fi/XW/2001”> <invoice … > <invoice … > <identifierdata ...><identifierdata ...>

......

</identifierdata></identifierdata>

<specification ...> <specification ...>

......

</specification> </specification> </invoice> </invoice></xw:wrapper></xw:wrapper>

Splitting of Splitting of source content source content

into partsinto parts(-> elements)(-> elements)

SDPL 2002 Notes 8: XML Wrapping 17

Recognition of content parts (1)Recognition of content parts (1)

by by separatorsseparators; For example:; For example:<invoice xw:starter="\^INVOICE”…<invoice xw:starter="\^INVOICE”…

by positionby position (within surrounding part): (within surrounding part):

<invoicenumber <invoicenumber xw:position="53 64"/> xw:position="53 64"/>

(Invoice number is in positions 53..64 of the (Invoice number is in positions 53..64 of the first row of an first row of an identifierdataidentifierdata-part)-part)

<identifierdata <identifierdata xw:childterminator="\n” ...xw:childterminator="\n” ...

for for sub-sub-partsparts

SDPL 2002 Notes 8: XML Wrapping 18

Recognition of content parts (2)Recognition of content parts (2)

In binary data by In binary data by content data typescontent data types; ; For example:For example:

<xw:wrapper xw:sourcetype="binary”...><xw:wrapper xw:sourcetype="binary”...>

<A xw:type=”byte"/><A xw:type=”byte"/><B xw:type="string" <B xw:type="string"

xw:stringLength=”20"/>xw:stringLength=”20"/><C xw:type=”int"/> <C xw:type=”int"/>

</xw:wrapper></xw:wrapper>– Split input to a byte, a string of 20 charactes, and Split input to a byte, a string of 20 charactes, and

an integer; (an integer; ( elements elements AA,, BB andand CC))

SDPL 2002 Notes 8: XML Wrapping 19

Repetition:Repetition: <line xw:terminator="\n" <line xw:terminator="\n"

xw:minoccurs="2" maxoccurs="2"/>xw:minoccurs="2" maxoccurs="2"/>

– 2 rows 2 rows 2 2 lineline elements elements

Recognition of content parts (3)Recognition of content parts (3)

Alternative parts:Alternative parts:

<xw:CHOICE xw:maxoccurs=”unbounded"><xw:CHOICE xw:maxoccurs=”unbounded"> <A xw:starter=”\^aa” xw:terminator=”\n” /> <A xw:starter=”\^aa” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” /> <B xw:starter=”\^bb” xw:terminator=”\n” />

</xw:CHOICE></xw:CHOICE> – arbitrary number (at least 1) lines starting with ”arbitrary number (at least 1) lines starting with ”aaaa” ”

or ”or ”bbbb” ” elements elements AA or or BB

SDPL 2002 Notes 8: XML Wrapping 20

XW: Modifying the structure of XW: Modifying the structure of datadata Limited modification possible:Limited modification possible:

– discarding parts of datadiscarding parts of data– collapsing levels of hierarchycollapsing levels of hierarchy– adding levels of hierarchyadding levels of hierarchy

Not supported (yet):Not supported (yet):– generating new datagenerating new data– re-arranging existing datare-arranging existing data

SDPL 2002 Notes 8: XML Wrapping 21

Discarding parts of dataDiscarding parts of data

<spec xw:starter="SPEC” <spec xw:starter="SPEC”

xw:childterminator="\n">xw:childterminator="\n">

<!-- Split the ”SPEC” into rows: --><!-- Split the ”SPEC” into rows: -->

<!-- Ignore the first three rows: --><!-- Ignore the first three rows: -->

<xw:ignore <xw:ignore xw:minoccurs=”3" xw:minoccurs=”3" xw:maxoccurs=”3" />xw:maxoccurs=”3" />

. . .. . .

</spec></spec>

SDPL 2002 Notes 8: XML Wrapping 22

Collapsing hierarchyCollapsing hierarchy

<data<data xw:starter=”START”xw:starter=”START” xw:terminator=”END” xw:terminator=”END” xw:childterminator="\n”> xw:childterminator="\n”>

<!-- ’data’ is made of rows --><!-- ’data’ is made of rows -->

<xw:collapse><xw:collapse>

<date xw:position=”5 14"/><date xw:position=”5 14"/>

<sum xw:position=”16 21"/><sum xw:position=”16 21"/></xw:collapse></xw:collapse>

. . .. . .

</data></data>

SDPL 2002 Notes 8: XML Wrapping 23

Collapsing hierarchy Collapsing hierarchy (2)(2)

– Split source data into Split source data into parts according to parts according to specified separatorsspecified separators

STARTSTART 17.8.199617.8.1996 95.5095.50

ENDEND

<data><data>

. . .. . .

</data></data>

SDPL 2002 Notes 8: XML Wrapping 24

<data><data> <xw:collapse<xw:collapse>> </xw:collapse</xw:collapse>> . . . . . .</data></data>

Collapsing hierarchy Collapsing hierarchy (3)(3)

– split parts into sub-split parts into sub-parts, according to parts, according to sub-elementssub-elements

17.8.199617.8.1996 95.5095.50

SDPL 2002 Notes 8: XML Wrapping 25

<data><data> <xw:collapse<xw:collapse>>

<date><date> </date></date> <sum><sum> </sum></sum> </xw</xw::collapsecollapse>> . . .. . .</data></data>

Collapsing hierarchy Collapsing hierarchy (4)(4)

17.8.199617.8.199695.5095.50

<data><data> <date><date> </date></date> <sum><sum> </sum> </sum> . . . . . .</data></data>

17.8.199617.8.199695.5095.50

SDPL 2002 Notes 8: XML Wrapping 26

Adding levels of hierarchyAdding levels of hierarchy

Example: Recognizing IP addresses in Example: Recognizing IP addresses in binary databinary data

<xw:ELEMENT xw:name=”IP-address"><xw:ELEMENT xw:name=”IP-address">

<a xw:type="byte"/> <a xw:type="byte"/> <b xw:type="byte"/> <b xw:type="byte"/>

<c xw:type="byte"/> <c xw:type="byte"/> <d xw:type="byte"/><d xw:type="byte"/>

</xw:ELEMENT></xw:ELEMENT>

SDPL 2002 Notes 8: XML Wrapping 27

Adding levels of hierarchy Adding levels of hierarchy (2)(2)– Binary data = string Binary data = string

of bytesof bytes

<a>193</a> <a>193</a> <b>167</b> <b>167</b> <c>232</c> <c>232</c> <d>253</d><d>253</d>

193193 232232167167 253253

<IP-address> <IP-address>

<a>193</a> <a>193</a>

<b>167</b> <b>167</b> <c>232</c> <c>232</c> <d>253</d> <d>253</d> </IP-address> </IP-address>

SDPL 2002 Notes 8: XML Wrapping 28

Adding levels of hierarchy Adding levels of hierarchy (3)(3) NB: an NB: an xw:ELEMENT xw:ELEMENT does not correspond to parts of does not correspond to parts of

input data (like ordinary result elements do):input data (like ordinary result elements do):

<!-- Wrap first two lines as INTRO: --><!-- Wrap first two lines as INTRO: --><data xw:childterminator="\n"/><data xw:childterminator="\n"/> <xw:ELEMENT xw:name="INTRO"> <xw:ELEMENT xw:name="INTRO">

<!--parts recognised by childterminators:--><!--parts recognised by childterminators:--> <xw:collapse /><xw:collapse /> <xw:collapse /><xw:collapse />

</xw:ELEMENT></xw:ELEMENT> … …

</data></data>

SDPL 2002 Notes 8: XML Wrapping 29

XW: ImplementationXW: Implementation

Prototype implemented with JavaPrototype implemented with Java Apache Xerces 2.0.1 used as a SAX parserApache Xerces 2.0.1 used as a SAX parser

– to read the wrapper description, to read the wrapper description, which is represented internally as ..which is represented internally as ..

a a wrapper treewrapper tree– guides the parsing of source dataguides the parsing of source data

SDPL 2002 Notes 8: XML Wrapping 30

Wrapper TreeWrapper Tree

Wrapper tree nodeWrapper tree node– corresponds to an element of wrapper descriptioncorresponds to an element of wrapper description– used for matching parts of source dataused for matching parts of source data– includes sets includes sets SS, , BB, , EE and and F F of stringsof strings

» computed from wrapper descriptioncomputed from wrapper description» SS: element's own : element's own sstarter stringstarter strings» BB: strings that can : strings that can bbegin part of elementegin part of element» EE: strings that can : strings that can eend part of elementnd part of element» FF: strings that can : strings that can ffollow the part of elementollow the part of element

SDPL 2002 Notes 8: XML Wrapping 31

<xw:wrapper xw:name="Wrapper tree example"<xw:wrapper xw:name="Wrapper tree example" xw:sourcetype="text"xw:sourcetype="text" xmlns:xw="http://www.cs.uku.fi/XW/2001">xmlns:xw="http://www.cs.uku.fi/XW/2001"> <doku xw:childterminator="<doku xw:childterminator="\n\n" terminator="" terminator="$$">"> <a xw:starter="<a xw:starter="\^A\^A" xw:minoccurs="0"/>" xw:minoccurs="0"/> <b xw:starter="<b xw:starter="\^B\^B" xw:minoccurs="0"/>" xw:minoccurs="0"/> <c xw:starter="<c xw:starter="\^C\^C"/>"/> <xw:CHOICE xw:minoccurs="0"<xw:CHOICE xw:minoccurs="0" xw:maxoccurs="unbounded">xw:maxoccurs="unbounded"> <d xw:starter="<d xw:starter="\^D\^D"/>"/> <e xw:starter="<e xw:starter="\^E\^E"/>"/> </xw:CHOICE></xw:CHOICE> </doku></doku></xw:wrapper></xw:wrapper>

dokudokuS: S: B:B:\^A \^A ,,\^B\^BE: E: $$ F:F:

bbS:S:\^B \^B B:B:\^B\^BE:E:\n\n F:F:\^C\^C

aaS:S:\^A\^A B:B:\^A\^AE:E:\n\n F:F:\^B\^B,,\^C\^C

xw:CHOICExw:CHOICES: S: B:B:\^D\^D,,\^E\^E E: E: F:F:\^D\^D,,\^E\^E, , $$

ccS:S:\^C\^C B:B:\^C\^CE:E:\n\n F:F:\^D\^D,,\^E\^E, , $$

ddS:S:\^D\^D B:B:\^D\^D E:E:\n\n F:F:\^D\^D,,\^E\^E, , $$

eeS:S:\^E\^E B:B:\^E\^E E:E:\n\n F:F:\^D\^D,,\^E\^E, , $$

AaaaAaaaBbbbBbbbCcccCcccEeeeEeeeDdddDdddDdddDddd

SDPL 2002 Notes 8: XML Wrapping 32

Executing a wrapper (simplified)Executing a wrapper (simplified)

Traverse the wrapper tree; In each node:Traverse the wrapper tree; In each node:– scan input until the start of corresponding part found scan input until the start of corresponding part found

(member of set (member of set BB))– report report startElement(…)startElement(…) – EitherEither

» process child nodes recursively, or process child nodes recursively, or

» report report characters(…)characters(…) for a leaf-level element for a leaf-level element

– scan input until the end of the part (using sets scan input until the end of the part (using sets EE and and FF))– report report endElement(…)endElement(…)– if node iterative, and a string in if node iterative, and a string in BB found, reprocess found, reprocess

nodenode

SDPL 2002 Notes 8: XML Wrapping 33

Development statusDevelopment status

Fall 2001: language designed from Fall 2001: language designed from concrete examplesconcrete examples

Spring 2002: Design of implementation Spring 2002: Design of implementation principles, implementationprinciples, implementation– wrapping of separator-based and wrapping of separator-based and

positional text data implementedpositional text data implemented– wrapping of binary data (and few other wrapping of binary data (and few other

details) unimplemented details) unimplemented

SDPL 2002 Notes 8: XML Wrapping 34

XW: Possible extensionsXW: Possible extensions

Generation of attributes and data Generation of attributes and data contentcontent

Re-arrangement of contentRe-arrangement of content Describing recursive (unlimited Describing recursive (unlimited

nesting) source structuresnesting) source structures=> recognizing LL(k) languages=> recognizing LL(k) languages

(Usefulness for wrapping data formats?)(Usefulness for wrapping data formats?)

SDPL 2002 Notes 8: XML Wrapping 35

SummarySummary

XW: a convenient "XML wrapper XW: a convenient "XML wrapper description language”description language”– for translating legacy data to XMLfor translating legacy data to XML– declarative wrapper descriptiondeclarative wrapper description– easier than develop and maintain ad-hoc easier than develop and maintain ad-hoc

conversion programsconversion programs– running prototype implementationrunning prototype implementation