xml and rdbmss (storage and publishing)

1

1

1

ICS-FORTH & Univ. of Crete Spring 2010

XML and RDBMSs(Storage and Publishing)

2


Where’s the XML Data?

ExportLegacy databases

ImportWarehouse XML data

?

WrapSOAP/CORBA/Java RMI

Business

Application

Logic

?

ViewMinimal result

2

2

3


Two XML - DBMS Problems

� XML documents need to be stored and retrieved in databases

�Exploit capabilities of RDBMS for efficient XML data processing

�Scalability, Availability, Performance, ...

�Need to support mapping schemes for shredding XML into relations

�Need to propagate constraints from XML to relations

� Data stored in databases need to be published in XML

�Most (structured) data stored in RDBMS

�Seamless integration of XML and Relational data

�Specification schemes for publishing needed

�Efficient publishing algorithms needed

� When not to use RDBMS

�Streaming data (RSS, SOAP), Information Retrieval (Google)

4


Storing XML Data

� Storage foundation of efficient XML query processing

�Flat streams: store XML data as is in text files

�fast for storing and retrieving whole documents

−query support: limited; concurrency control: no

�Native XML Databases: designed specifically for XML

�XML document stored as is

�Efficient support for XML queries

�Many techniques need to be re-developed

�Colonial Strategies: Re-use existing storage systems

�Leverage mature systems (RDBMS)

�Simple integration with legacy data

�Map XML document into underlying structures

E.g., shred document into flat tables

3

3

5


� A new XML type (like varchar, date, numeric)

� SQL / XML 2003, the first edition of the SQL/XML standard published by the ISO as part 14 of the SQL standard: ISO/IEC 9075-14:2003

�XML type restricted to

�XML document or

�XML element or

�Sequence of XML elements

� SQL / XML, 2nd edition (ISO/IEC 9075-14:2006)

�Full support of the XQuery Data Model

�XML(SEQUENCE), XML(ANY CONTENT), ...

� N.B. XML in different rows is incomparable

Commercial Systems -- XML Type in SQL

6


Example (SQL:2003)

create table books(

title varchar(20),

authors XML);

„D. Kossmann“Dual Buffering

<author>D. Chamberlin </author>

<author>D. Florescu</author>

<author>et al.</author>

XQuery 1.0

AuthorsAuthorsAuthorsAuthorsTitleTitleTitleTitle

No schema validation, no typing!

4

4

7


From XML to Relations

� Sophisticated query processing, storage and concurrency control

techniques have been developed for relational DBMS

�Thus, why not take advantage of available DBMS techniques?

� Store and query XML data using relational DBMS

�Derive a relational schema from an XML DTD (schema)

�Shred XML data into relational tuples – store XML data

�Translate XML queries to SQL queries

�Convert query results back to XML

RDB

query answer

store query translation

DBMS

XML

8


Architecture: XML Shredding

RDB

query answer

store

DBMS

XML

XML query parsing query rewriting

XML tagger Tagging results

query translation

Relational schemagenerator

XML documentshredder

5

5

9


Nontrivial Issues

� XML storage peculiarities

�Characteristics of XML data: optional elements & values, repetition, choice, inherent order, large text fragments, mixed content

�Mapping DTDs to relational schemas: data model mismatch

• DTD: recursive, regular expressions

• relational schema: tables

�Characteristics of XML queries: document order & structure, full-text search, transformation

�Query translation: query language mismatch

• XQuery, XSLT: Turing-complete

• SQL: first-order

� Information preservation:

�lossless: there should be an effective method to reconstruct the original XML document from its relational storage – the inverse of the storage mapping should exist

�propagation/preservation of integrity constraints – will be seen shortly

10


Derivation of a Relational Schema

� Lossless: the original document can be effectively reconstructed from its relational representation

� Query support: XML queries should be able to be rewritten to efficientrelational queries

� Normal form: BCNF, 3NF have proved extremely useful for integrity maintenance of updates -- will be addressed later

Relational schema generator

XML document shredder

6

6

11


Running Example: Book Document

DTD:

<!ELEMENT db (book*)>

<!ELEMENT book (title,authors*,chapter*,ref*)>

<!ELEMENT chapter (text | section)*>

<!ELEMENT ref book>

<!ELEMENT title #PCDATA>

<!ELEMENT author #PCDATA>

<!ELEMENT section #PCDATA>

<!ELEMENT text #PCDATA>

� Recursive

� Complex regular expressions

12


Graph Representation of the DTD

� Each element type/attribute is represented by a unique node

� Edges represent the subelement (and attribute) relations

� *: 0 or more occurrences of subelements

� Cycles indicate recursione.g., book

� Simplification: e.g., (text|section)*� text* | section* -- ignore order

� XML document conforming to the DTD are those trees that unfold the graph (special treatment: * � empty at leaf)

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

7

7

13


Canonical Representation

� Store an XML document as a graph (tree)

�Node relation: node(nodeId,tag,type)

e.g., node(02,book,element), node(03,author,element)

�Edge relation: edge(sid,did)

sid, did: source and destination nodes; e.g., edge(02,03)

� Pros and cons

�Lossless: the original document can be reconstructed; order preserving:

one can add “order” information to the edge relation

�Flexible schema: easy schema evolution but ignores regularity in structure

�Supports heterogeneous elements with optional values

�Querying: requires multi-table joins or self joins for element reconstruction

�A simple query /db/book[author=“Bush”]/title requires 3

joins of the edge relation!

14


Schema-driven: the Shred Inlining Algorithm

� Require the availability of the document DTD

� Represent the DTD as a graph (simplifying

regular expressions)

� Traverse the DTD graph depth-first

�and create relations for the nodes

�the root

�each * node

�each recursive node

�each node of in-degree > 1

� Inlining: nodes with in-degree of 1

are inlined – no relation is created

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

8

8

15


Relational Schema Example

� To preserve the semantics

�ID: each relation has an artificial ID (key)

�parentID: foreign key coding edge relation

�Column naming: path in the DTD graph

db(dbID)

book(bookID,parentID,code,title:string)

author(authorID,bookID,author:string)

chapter(chapterID,bookID)

ref(refID,bookID)

text(textID,chapterID,text:string)

section(sectionID,chapterID,section:string)

Note: title is inlined

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

16


Keys and Foreign Keys

db(dbID)

book(bookID,parentID,code,title:string)

author(authorID,bookID,author:string)

chapter(chapterID,bookID)

ref(refID,bookID)

text(textID,chapterID,text:string)

section(sectionID,chapterID,section:string)

� Keys: book.bookID, author.authorID, …

book.parentID ⊆ db.dbID if code = 1

book.parentID ⊆ ref.refID if code = 0

� Foreign keys: referring to parent node

author.bookID ⊆ book.bookID, similarly for chapter, ref

text.chapterID ⊆ chapter.chapterID, similarly for section

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

9

9

17


� Use associated DTD/XML Schema to decompose document

� Shred inlining:

� Rule of thumb: Inline as much as possible to minimize number ofjoins

� Shared: do not inline if shared, set-valued, recursive

� Hybrid: also inline if shared but not set-valued or recursive

� Querying: It supports a large class of XML queries commonly encountered

+ Fast lookup & reconstruction of inlined elements

- Reconstruction may require multi-table joins and unions

Schema-Driven: Summary

18


Schema-Driven: Summary

� Instance mapping can be easily derived from schema mapping

� Is it lossless?

�The order information is lost (simplification of regular expressions

defining element types)

� Is there anything missing?

�“core dumping” the entire document to a new database

In practice one often wants to

�select relevant data from the document

�store the selected data in an existing database of a predefined

schema

�XML Schema: type + constraints

What happens to XML constraints? Can we achieve normal forms

(BNCF, 3NF) for the relational storage?

10

10

19


Selective Mapping from XML to Relations: Example

� Existing relational database R :

book(id,title)

ref(id1,id2)

� Select from XML and store in R

�books with title containing “WMD”, and

�books cited, directly or indirectly

� Difference:

�select only part of the data from an input document

�store the data in an existing database with a fixed schema

RXML

existing DBbook documentSQL inserts

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

20


Mapping Specification: XML2DB Mappings

�XML2DB Mapping:

�Input: an XML document T of a DTD D, and an existing database schema R

�Output: a list of SQL inserts ∆R, incrementing the database of R

�Rely on an extension of Attribute Grammars:

�treat the DTD D as an ECFG (extended context-free)

�associate semantic attributes and actions with each production rule

�attributes: passing data top-down

�actions: generate SQL inserts ∆R

�evaluation: generate SQL inserts in parallel with XML parsing

11

11

21


XML2DB Mappings

� DTD: normalized; element type definitions e → α

α ::= PCDATA | ε | e1, …, en | e1 + … + en | e*

� Relation variables: for each relation schema Ri, define a variable ∆Ri, which holds tuples to be inserted into Ri

� Attributes: $e associated with each element type e

$e: tuple-valued, to pass data value top-down

� Rules: associated with each e → α; conditional statements�for each e’ in α, define $e’ using the parent attribute $e�Increment relation variables: ∆Ri := ∆Ri U {tuple}

22


XML2DB Mapping: Attributes Example

� db → book*

$book := top:= top:= top:= top /* indicating the children of the root

db

book ...book book book

$book $book$book$book

db

title

book

author chapter ref

****

**** ********

**** ****

sectiontext

12

12

23


XML2DB Mapping: Semantic Actions Example

� book → title, author*, chapter*, ref*if text(title) contains “WMD” or ($book<>toptoptoptop and $book<>bot)then id:=gen_id( ); /* generate a new id */

∆book:=∆book U {(id,text(title))};/*increment book*/if $book <> top /* cited by another book */

then ∆ref:=∆ref U {($book,id)}; /*increment ref*/$ref := id; /* passing information downward */

else $ref := botrecall relation schema: book(id,title), ref(id1,id2)

� gen_id(): a function generating a fresh unique id

� conditional: either has “WMD” or is referenced by a

book of WMDdb

reftitle

“WMD”

book

chapterchapter ref

$book

$refrefrefref $refrefrefref

ref ...

24


Implementing XML2DB Mappings

XML

XML2DB

RDBparsing

SAX parsing: SQL insertsgeneration

SAX actions

SQL insertsexecution

SAX parsing extended with corresponding semantic actions

� startDocument( ), endDocument( );

� startElement(A, eventNo), endElement(A);

� text(s)

SQL updates:

insert into book

select *

from ∆∆∆∆bookbookbookbook

13

13

25


XML Constraints

� Recall: an XML schema consists of both types and constraints

�(//book,{isbn}) -- isbn is an (absolute) key of book

�(//book,(chapter,{number}) -- number is a key of chapter

relative to book

�(//book,(title,{ })) -- each book has a unique titledb

book bookbook book

title chapter

“XMLXMLXMLXML”

chapter

title section

“6”

number section

number text DTD

number

“1” number XPath

title

“XMLXMLXMLXML” number

“1”

number

“10”

isbn isbn

title

chapter chapter

26


Mapping from XML to a Predefined Relation

� One wants to store certain information from the XML document in:

RDB: chapter(bookTitle, chapterNum, chapterTitle)

�Mapping: for each book, extract its title, and the numbers and titles of all its chapters

�Predefined relational key: (bookTitle, chapterNum)

� Can the XML document be mapped to the relation without violating the key?

chapter chapter

db

book bookbook book

title chapter

“XMLXMLXMLXML”

chapter

titletitletitletitle section

“6”

numbernumbernumbernumber section

number text DTDDTDDTDDTD

number

“1111” number XPathXPathXPathXPath

title

“XMLXMLXMLXML” numbernumbernumbernumber

“1111”

number

“10”

isbn isbn

titletitletitletitle

14

14

27


A Safe Mapping

� Now change the relational schema to

RDB: chapter(isbn, chapterNum, chapterTitle)

�The relation can be populated without any violation. Why?

�The relational key (isbn, chapterNum) for chapter is implied (entailed) by the keys on the original XML data:

(//book,{isbn})

(//book,(chapter,{number})

(//book,(title,{ }))

chapter chapter

db

book bookbook book

title chapter

“XMLXMLXMLXML”

chapter


“6”



number


title


“1111”

number

“10”

isbn isbn


28


Why Do We Care About Constraints?

� Constraints are a fundamental part of the semantics of the data –mapping from XML to relations should not lose the information

� Relational constraints are important for query optimization, data cleaning, and consistency/integrity maintenance, . . .

� Constraints help us determine whether a relational schema for storing XML data makes sense or not

� Problem statement: Constraint Propagation

�Input: a set K of XML keys, a predefined relational schema S, a mapping f from XML to S, and a functional dependency FD over S

�Output: is FD implied by K via f? I.e., does FD hold over f(T) for any XML document T that satisfies K?

Note: XML schema/DTD is not required – K is the only semantics

15

15

29


Constraints Can Do Even Better

� One wants to find a “good” relational schema to store:

chapter(isbn,bookTitle,author,chapterNum,chapterTitle)

� What is a good schema? In normal form: BCNF, 3NF, …

�Prevent update anomaly (the relational theory)

�Efficient storage, …

� But how to find a normalized design?

chapter chapter

db

book bookbook book

title chapter

“XMLXMLXMLXML”

chapter


“6”



number


title


“1111”

number

“10”

isbn isbn


30


Constraint Propagation and Normalization

From given XML keys:

(//book,{isbn}), (//book,(chapter,{number}),

(//book,(title,{ }))

one can derive functional dependencies:

isbn → bookTitle,

isbn, chapterNum → chapterTitle

Normalize the relation by using these functional dependencies:

chapter(isbn,bookTitle,author,chapterNum,chapterTitle)

book(isbn, bookTitle),

chapter(isbn, chapterNum, chapterTitle),

author(isbn, author)

The new schema is in BCNF!

16

16

31


Computing Minimum Cover of Propagated FDs

� Problem statement: Minimum Cover of Propagated FDs

�Input: a set K of XML keys, and a mapping f from XML to S

�Output: a minimum cover F of all the functional dependencies (FDs) implied by the XML keys K via f

�F is a cover (a set of FDs): any FD implied by K via f is also implied by F

�F is minimum: F contains no redundant FDs, i.e., any FD in F is not entailed by other FDs in F

� Why not compute the complete set of FDs implied by K via f?

�Exponentially large!

� Why care about minimum cover?

�Normalize relational schema for storing/querying XML data!

Remark: again XML schema/DTD is not required – K is the only semantic specification

32


Research Issues

� What we know: for XML keys and a restricted transformation language

(from XML to relations), there exist efficient algorithms

�Constraint propagation: polynomial

�Minimum cover of propagated FDs: polynomial

� For general constraints/transformation languages: beyond reach

�Undecidable if the transformation language is relationally complete

(selection, projection, join, union, difference), even for XML keys

�Undecidable if both XML keys and foreign keys are considered,

even for the identity “transformation”

� Open: identify

�practical transformation languages

�practical XML constraints

that allow efficient constraint propagation

17

17

33


� Store XML as a BLOB in relational database�index by materializing indexed expressions in separate columns�Plus: store XML in parsed and validated form�Minus: proprietary solution (blob is a black box)�Minus: replicate data for indexing

� Model-driven Shredding (schema-agnostic or generic) (Florescu, Kossmann 99)�Edge, binary approach + alternatives �Corresponds to generic APIs for Java �Plus: very general; integrates well with relational data�Minus: poor performance

� Schema-based Shredding (or schema-specific) (Shanmugasundaram et al. 99)�Map XML Schema / DTD to SQL DDL�Plus: integrates well with relational data�Minus: missing tools, complicated

� SQL/XML (latest product developments after 2005)�Extend SQL with XML data type�Plus: integrates well with relational data�Minus: not clear how it integrates with application, odd „marriage“

XML-RDB Summary of Storage Approaches

34


XML Publishing

� An XML view definition language: specifying desired mapping

� Efficient implementation of the view language

DB DB

XML

Web

XML

Q: XML view

18

18

35


From Relations to XML Views

ActorActorActorActor(aidaidaidaid, lname, fname, lname, fname, lname, fname, lname, fname) :

{ <001, 001, 001, 001, ““““ViterelliViterelliViterelliViterelli””””, , , , ““““JoeJoeJoeJoe””””>,

…}

MovieMovieMovieMovie(midmidmidmid, title, year, title, year, title, year, title, year) :

{ <011, 011, 011, 011, ““““Analyze ThisAnalyze ThisAnalyze ThisAnalyze This””””, 1999, 1999, 1999, 1999>,

<032, 032, 032, 032, ““““See Spot RunSee Spot RunSee Spot RunSee Spot Run””””, 2001, 2001, 2001, 2001>

…}

AppearanceAppearanceAppearanceAppearance(midmidmidmid, , , , aidaidaidaid) :

{ <001, 011001, 011001, 011001, 011>, <001, 032001, 032001, 032001, 032>

…}

<Actor aid=001Actor aid=001Actor aid=001Actor aid=001><LnameLnameLnameLname>ViterelliViterelliViterelliViterelli</LnameLnameLnameLname><FnameFnameFnameFname>JoeJoeJoeJoe</FnameFnameFnameFname><Movie mid=011 year=Movie mid=011 year=Movie mid=011 year=Movie mid=011 year=““““1999199919991999””””> > > >

Analyze This Analyze This Analyze This Analyze This </MovieMovieMovieMovie><Movie mid=032 year=Movie mid=032 year=Movie mid=032 year=Movie mid=032 year=““““2001200120012001””””> > > >

See Spot Run See Spot Run See Spot Run See Spot Run </MovieMovieMovieMovie>

</ActorActorActorActor><Actor aid=002Actor aid=002Actor aid=002Actor aid=002>

<LnameLnameLnameLname>WinterWinterWinterWinter</LnameLnameLnameLname><FnameFnameFnameFname>AlexAlexAlexAlex</FnameFnameFnameFname><Movie year=Movie year=Movie year=Movie year=““““1988198819881988””””> Bill and Ted> Bill and Ted> Bill and Ted> Bill and Ted’’’’ssssExcellent Adventure Excellent Adventure Excellent Adventure Excellent Adventure </MovieMovieMovieMovie>

36


Commercial Systems -- Canonical Publishing

� Canonical publishing: the universal-relation approach

�Embedding single SQL query in XSL stylesheet

�Result: canonical XML representation (1:1) of published relations

� Systems:

�Oracle 10g XML SQL facilities

�DB2

�SQL Server 2005

� Incapable of expressing practical XML publishing: default fixed XML

document template

�Map SQL Types to XML Schema Types

�Encode special characters: e.g., „<“ -> „<“

�Provide functions to construct XML Data

19

19

37


<Actor>

<Lname>Viterelli</Lname>

<Fname>Joe</Fname>

<Movie year=“1999”> Analyze This

</Movie>

<Movie year=“2001”> See Spot Run

</Movie>

</Actor>

<Actor>

<Lname>Winter</Lname>

<Fname>Alex</Fname>…

<SQL_stmt>

SELECT A.lname,A.fname, M.year, M.title

FROM Movie M, Actor A, Appearance Ap

WHERE M.mid=Ap.mid AND M.aid=A.Aid

ORDER BY aid

</SQL_stmt>

A.lname A.fname M.year M.title

Viterelli Joe 1999 See Spot Run

Viterelli Joe 2001 Analyze This

<element_node Actor>

<element_node Lname>

<text_node>

<Column name=“A.lname”/>

</text_node>

…

IBM DB2 SQL Statement giving

big relation

+ Formatting Template annotated withcolumns of the universal relation

Canonical Publishing

38


Map SQL Types to XML Schema Types

Xsd:patternXsd:durationINTERVAL

Xsd:patternXsd:dateTimeTIMESTAMP

Xsd:patternXsd:timeTIME

Xsd:patternXsd:dateDATE

Xsd:doubleDOUBLE

Xsd:floatFLOAT

precision, scaleXsd:decimalNUMERIC, DEC.

min-maxInclusiveXsd:integerSMALLINT, ...

Xsd:maxLengthXsd:base64binaryBLOB

Xsd:maxLengthXsd:stringVARCHAR

Xsd:lengthXsd:stringCHAR

Xsd:booleanBOOLEAN

RestrictionsRestrictionsRestrictionsRestrictionsXML SchemaXML SchemaXML SchemaXML SchemaSQLSQLSQLSQL

20

20

39


Architecture: XML Publishing Middleware

Query

GeneratorRequest

SQL

Queries

???

View

Definition

Query-cost Estimates

Source Capabilities

Tagger

40


XPERANTO

� Commercial system: IBM DB2 XML extender, SQL/XML

� Middleware (vendor-independent): XPERANTO

� Extending SQL with XML constructors:

select XML-aggregation

from R1, . . ., Rn

where conditions

� XML constructors (XML-aggregation): functions

�Input: tables and XML trees (forest)

�Output: XML tree

21

21

41


XML Publishing with XPERANTO

Relational schema:

Extended SQL:

select XMLAGG(ACTOR(lname,fname,

select XMLAGG(MOVIE(title,year)

from Appearance Ap,Movie M

where Ap.aid = A.aid and Ap.mid = M.mid

group order by A.lname, A. fname ))

from Actor A

Actor (aid, lname, fname)

Appearance (mid, aid)

Movie (mid, title, year)

<<<<ActorActorActorActor>

<LnameLnameLnameLname>ViterelliViterelliViterelliViterelli</LnameLnameLnameLname>

<FnameFnameFnameFname>JoeJoeJoeJoe</FnameFnameFnameFname>

<Movie year=Movie year=Movie year=Movie year=““““1999199919991999””””> . . . > . . . > . . . > . . . </MovieMovieMovieMovie>

<Movie year=Movie year=Movie year=Movie year=““““2001200120012001””””> . . . > . . . > . . . > . . . </MovieMovieMovieMovie>

</ActorActorActorActor>

<ActorActorActorActor>

<LnameLnameLnameLname>WinterWinterWinterWinter</LnameLnameLnameLname>

<FnameFnameFnameFname>AlexAlexAlexAlex</FnameFnameFnameFname>

…………

42


XML Constructors

� Actor constructor:

create function ACTOR(lname:str, fname:str, mlist:XML)<Actor>

<Lname> {lname} </Lname> <Fname> {fname} </Fname>mlist

</Actor>

� Movie constructor (mlist):

create function MOVIE(title:str, year:int)

<Movie year=“{year}”> {title} </Movie>

� Verbose and cumbersome

�small document: tedious

�large documents: unthinkable

22

22

43


SilkRoute

� Annotated template: embedding SQL in a fixed XML tree

�Middleware: SilkRoute

�Commercial: SQL Server 2005, IBM DB2 DAD, XML/SQL

� Advantages:

�More `modular’ comparing to the universal relation approach

�Limited schema-driven: conforming to a fixed doc template

root

Q1 Q2Q3

Q4 Q5

tag1 tag2 tag3

tag5 tag6

44


� Data transformations based on inter-schema constraints

�Given two schemas S1 and S2, want to translate instances of S1 to instances of S2. Constraint-driven approach:

�Start with a set of referential integrity constraints (foreign keys) from S1 to S2

�Derive schema (logical) mapping from S1 to S2, by reasoning about the given integrity constraints (chasing technique)

�Derive mapping queries for translating instances of S1 to instances of S2

� Pros and cons:

�Generic system, and can be used for XML publishing

�Semi-automated

�Chasing may not terminates for XML DTD/Schema

�Recursive target schema (S2)? unclear

Clio

23

23

45


Getting Real: Data Exchange on the Web

� All members of a community (industry) agree on a DTD and

then exchange data w.r.t. it: e-commerce, health-care, ...

� XML Publishing:

�mapping relational data to XML

�conforming to the predefined DTD

DTD

DB DB

XML

Web

XML

Q: XML view

46


Data Exchange: Insurance Company and Hospital

Daily report:

Relational database R at the hospital:

Patient (SSN, name, tname, policy#, date)

inTreatment (tname, cost)

outTreatment (tname, referral#)

Procedure (tname1, tname2)

treatment

� in hospital: composition hierarchy in Procedure

� outside of the hospital: referral#

R XML

hospital insurance companyXML view

24

24

47


Data Exchange: Insurance Company and Hospital

� DTD D predefined by the insurance company:

report → patient*

patient → SSN, pname, treatment, policy#

treatment → tname, (inTreatment + outTreatment)

inTreatment → treatment*

outTreatment → referral#

� How to define a mapping σ such that for any instance DB of R,

�σ (DB) is an XML document containing all the patients and their

treatments (hierarchy, referral#) from DB,

�σ (DB) conforms to D?

48


Challenge: Recursive Type

� XML data: unbounded depth -- cannot be decided statically


inTreatment → treatment* --- recursive

SSN

“123”

treatment

inTreatment

...treatment

...

...report

patient patient patientpatient

treatmentname

tname

“Joe”

...

Policy#Policy#Policy#Policy#

<-- unbounded

25

25

49


Challenge: Non-Determinism

� The choice of a production (element type definition)

treatment → tname, (inTreatment ++++ outTreatment)-- depends on the underlining relational data

report

...patient patient patientpatient

...treatment treatment referral#

inTreatment

tnameoutTreatment

treatment <-- data-dependent

50


Existing Systems

� Fixed XML tree template or ignoring DTD-conformance

�middleware: SilkRoute (AT&T), XPERANTO (IBM),…

�systems: SQL Server 2005, IBM DB2 XML extender, …

incapable of coping with a predefined DTD (e.g. recursion)

� Type checking: define a view and then check its conformance

�undecidable in general, co-NEXPTIME for extremely restricted view

definitions

�no guidance on how to define XML views that typecheck

one gets an XML view that typechecks only after repeated failures and

with luck

26

26

51


� DTD: normalized; element type definitions e → α

α ::= PCDATA | ε | e1, …, en | e1 + … + en | e*

� Attributes: $e associated with each element type e

$e: tuple-valued, to pass data value as well as control

� Rules: associated with each e → α; for e’ in α, $e’ = Q($e)

�SQL query Q extracts data from DB

�parent attribute $e as a constant parameter in Q

ATG DTD

Semantic rulesattributes

=+

Schema-directed XML Publishing: Attribute Translation Grammar (ATG)

52


Semantics: Conceptual Evaluation

� Top-down

report → patient*

$patient ←←←← select SSN, name, tname, policy

from Patient --- SQL query

recall Patient (SSN, name, tname, policy#)

� Data-driven: a patient element for each tuple in Patient relation

report

patient ...patient patient patient

$patient $patient$patient$patient

27

27

53


Inherited attributes

Inherited: $child is computed using $parent

� patient → SSN, name, treatment, policy#

$SSN=$patient.SSN, $name=$patient.name

$treatment=$patient.tname $policy#=$patient.policy

recall $patient = (SSN, name, tname, policy)

� SSN → PCDATA

$PCDATA = $SSN report

SSN

“123”

patient ...patient patient patient

treatmentname

“Joe”

policy#

“LU23”

$patient

$SSNSSNSSNSSN

$PCDATAPCDATAPCDATAPCDATA

54


Coping with Non-Determinism


$tname = $treatment

($inTreatment, $outTreatment) =

case Qc($treatment).tag --- conditional query

1: ($treatment, null)

else: (null, $treatment)

Qc: selectselectselectselect 1 as tag fromfromfromfrom inTreatment wherewherewherewhere tname= $treatment

� conditional query: the choice of production

� $parent as constant parameter in SQL query

inTreatmenttname

treatment

outTreatment

treatment

tname

...Qc($treatment)=nullQc($treatment)=1

28

28

55


Coping with Recursion

inTreatment → treatment*

$treatment ← select tname2

from Procedure

where tname1 = $inTreatment

recall Procedure (tname1, tname2)

� $parent as constant parameter in SQL query Q

� inTreatment is further expanded as long as Q(DB) is nonempty

inTreatment

treatment treatment

treatment

tname

...$treatment $treatment

56


DTD-Directed Publishing with ATGs

� DTD-directed: the XML tree is constructed strictly following the

productions of a DTD --- DTD conformance

� Data-driven: the choice of productions and expansion of the XML tree

(recursion) depends on relational data

treatment

inTreatment

...treatment

...

...report

patient patient patientpatient

treatmentname

tname“Joe”

...

policy#policy#policy#policy#SSN

“123”

<-- unbounded

data-dependent

29

29

57


ATGs vs. existing systems

� DTD-conformance:

�ATGs: provide guidance for how to define DTD-directed publishing

�Other systems: based on a fixed tree template

� Expressive power: strictly more expressive than others

�ATGs: capable of expressing XML views supported by other

systems

�Other systems: cannot handle recursion/nondeterminism

ATG: the first systematic method to ensure DTD-conformance

58


ATGs vs. Attribute Grammars (AGs)

�AGs:

�Definition: w.r.t. a CFG

�Evaluation: parse a string with the CFG, then evaluate attributes

given the parse tree

�ATGs: combining DTD and database operations

�Definition: w.r.t. an ECFG (DTD) and SQL queries

�Evaluation: given DB, extract relevant data from DB with SQL

queries to build an XML tree of the DTD

It does not make sense to “parse” a database w.r.t. a DTD

ATGs are not a mild variation of AGs

30

30

59


PRATA: Middleware based on ATGs

XMLATG

R

parsing

query plan generation, evaluation

relationstagging

ATG graph

cost estimate

query

statistics

query results

� ATG graph: representing the ATG

� relations: representing root-leaf paths --->

� tagging: one pass

DCBA

60


Evaluation of ATGs

� Conceptual evaluation: defining semantics, but not efficient

� Techniques proved useful

�partitioning: reduce db visits, make use of DBMS optimizer

(handling annotated templates – complete tree)

�Relations representing XML trees: root-leaf paths

�Separate tagging

root

Q1 Q2Q3

tag1 tag3

Q4 Q5

tag2

tag5 tag6

partitioning

31

31

61


ATG graph

inTreatment outTreatment

referral#

report

patient

nameSSN treatment

tname

policy#

Q1

Q3

Q2 Q5

Q4

*

Q7Qc

procedure

Q6 Q9

*

Q8

� edge labels: SQL queries

� cyclic: introduced by recursion -- not in other systems

� iterative unfolding: partial ATG tree to a depth d

62


Partitioning partial ATG trees

� query composition (outer join)

� challenge: how to partition into clusters?

�finer ⇒ fewer nulls, smaller queries sent to DBMS

�coarser ⇒ leverage DBMS optimizer, less DB visits

Finding an optimal strategy: NP-hard

...

report

name

treatment

patient

SSN policy#

Q1

Q3Q2 Q5

Q4

*

*

inTreatment outTreatment

referral#

tnameQ7

Qc

procedureQ6 Q9

Q8

treatment

composed Q’

composed Q

32

32

63


Materialization of Intermediate Results

Partitioning: root-leaf relation R⇒ intermediate R1, R2, R3, R4

� repeated computation: each Ri carrying a key from the root

�materialization: storing keys in temporary tables

�cons: overhead with temporary tables

�pros: reducing recomputation in descendant queries

What keys to materialize? Not supported by other systems

R1

R2 R3

P1

P2P3

P4

output relation R

R4

sort merge

sort merge key

key

materializing keys in R1

materializing keys in R3

64


Query Plan Generation

� New challenge: find “best” partitioning and materialization

�mutually dependent: partitioning and materialization must be taken

together

�expensive: exponentially many choices – NP-hard

� Efficient cost-based heuristic:

�DBMS statistics: arity/cardinality of a table, cost to evaluate a

query/create temporary table

�cost estimate: simple formulae for computing the benefit of

materialization/partitioning -- effective in practice

�“optimal strategy” -- dynamic programming: decides partitioning

w.r.t. the benefit of materialization

33

33

65


ATG Summary

� ATGs:

�ensure DTD-conformance automatically

�support recursion/non-determinism naturally

�simple: no need for a new language – knowing DTD + SQL is all

that one needs for writing an ATG

� PRATA: middleware based on ATGs

�novel technique of combining partitioning and materialization

�a practical solution for DTD-directed publishing

ATG: the first systematic way for DTD-directed publishing

66


Extension: Capturing Integrity Constraints

� XML schema:

�Type (DTD)

�Integrity constraints: keys, foreign keys

� Schema-directed XML publishing: automatically guarantee that the

target document both conforms to the type and satisfies the constraints

-- in a single framework

Challenge: consistency analyses undecidable

DB

XML: ATG+DTD constraints

34

34

67


Schema-directed XML to XML transformations

� Mapping between XML documents

� XTG: XML Transformation Grammar

�DTD augmented with XQuery

�Constraint-compilation

XML XML

DTD

constraints

transformation

XTG

68


� Why publish XML data? What is the major difficulty?

� One can publish relational data via an XQuery view. How to do it?

What are the pros and cons of this approach?

� What is schema-directed publishing? Why do we need it?

� Why does ATG automatically ensure DTD-conformance?

Publishing Summary

35

35

69


Commercial Systems: MS SQL Server 2005

� CLOB (character large objects)

�Annotated schema (XSD): fixed tree templates�Non-recursive schema�associate elements and attributes with table and column names�core-dumping entire elements or documents�Given a relational database, XSD populates an XML elements/attributes with corresponding tuples/columns

� Data loading: �Combine INSERT and node( ), value( ) (OPENXML), XPath�OPENXML: access to XML data as a relational rowset�selective shredding, limited recursion, can’t store the entire document in a single pass

� Query support:

�SQL and SQL extensions for publishing results as XML (For-Xml clause)

�An extension of SQL with an FOR-XML construct

�Nested FOR-XML to construct XML documents

70


Commercial Systems: MS SQL Server 2005

� Summary:

�Support both entire-document and selective shredding

�Able to increment existing tables

�Nonrecursive schema

�Does not support context-dependent tuple construction: to

construct (A, B), one cannot extract attribute B based on the

extracted value of A (inf. passing)

�Does not support entire document shredding and selective

shredding in a uniform framework

�incapable of supporting schema-directed publishing

�can’t define recursive XML views (bounded recursion depth)

36

36

71


Microsoft SQL Server 2005

XML ParserXML ParserXMLXML

ValidationValidation

XML XML datatypedatatype

(binary XML)(binary XML)

SchemaSchema

CollectionCollection

XMLXML RelationalRelational

XML SchemataXML Schemata

OpenXMLOpenXML/nodes()/nodes()

FOR XML with FOR XML with

TYPE directiveTYPE directive

RowsetsRowsets

query()query()

modify()modify()Node Node

TableTable

PATH PATH

Index Index

PROP PROP

Index Index

VALUE VALUE

Index Index

PRIMARYPRIMARY

XML INDEXXML INDEX

query()query()

72


Microsoft SQL Server 2005: Indexing

� Create XML index on XML column

CREATE PRIMARY XML INDEX idx_1 ON docs (xDoc)

� Creates secondary indexes on tags, values, paths

� Speeds up queries

�Results can be served directly from index

�Entire query is optimized

�Same award winning cost based optimizer

�Indexes are used as available

� Create XML index on XML column

CREATE PRIMARY XML INDEX idx_1 ON docs (xDoc)

� Creates secondary indexes on tags, values, paths

� Speeds up queries

�Results can be served directly from index

�Entire query is optimized

�Same award winning cost based optimizer

�Indexes are used as available

37

37

73


Commercial System: IBM DB2 XML Extender

� XML Columns: CLOBs + side tables for indexing individual elements

� User-defined mapping through DAD (Document Access Definition): a fixed XML tree template (non-recursive)

�SQL mapping: template-based language to publish relational data as XML (to be discussed in the next lecture)

�RDB node mapping: for both publishing and storing XML (associating element and attributes with tables and columns)

� XML Collections: Declarative decomposition of XML into multiple tables

�Data loading: follows DAD mapping

� SQL/XML: an extension of SQL with XML constructors (XMLAGG, XMLELEMENT, etc) as discussed earlier

74


Commercial System: IBM DB2 XML Extender

� Summary:

�Support both entire-document and selective shredding

�Support recursive schema

�cannot increment existing tables (DB2)

�does not support context-dependent tuple construction

�incapable of supporting schema-directed publishing

�can’t define recursive XML views

38

38

75


IBM DB2 XQuery Support

Hybrid SQL/XQueryCompiler

XML Navigation

Query Evaluation Run-time

SQL/XML Parser XQuery Parser

Table Storage XML Indexes XML Storage

Native XML storage

High-performanceXPath processing

Unified internal

representation

XML node-level,text indexes

SQL extendedfor XML

Dynamic function dispatch

XML data typein SQL

XQuery top-levelparser

LanguagesCompose

76


Commercial System: Oracle 10g XML DB

� Store XML data in CLOB (character large objects) or tables

�Canonical mapping into object-relational tables

�tag names are mapped to column names

�elements with text-only map to scalar columns

�elements with sub-elements map to object types

�list of elements maps to collections

�Indexing: standard relational

�Core-dumping entire elements

�Annotated schema: recursive

� Publish relational data via DBMS_XMLGEN, a PL/SQL package

�Supports recursive XML view definition (via linear recursion of SQL’99)

� does not support schema-directed XML publishing

39

39

77


Commercial System: XQuery Support

� One of the first relational database to ship an XQuery implementation

�Supports XMLQuery and XMLTable construct

�Native compilation into SQL /XML structures

�Returns XMLType(Content)

�Can query over relational, O-R, XMLType data

�fn:doc - Maps to XDB Repository on server

�SQLPlus provides xquery command to execute XQuery

�XSL-T will also get compiled to XQuery

� Other query support:

�CLOBs: SQL + Oracle Text; XPath

�Canonical: SQL

78


Oracle Database 10gr2 XQuery Support

XQuery XSL-T Parser

Compiler Rewrite to SQLX

Normalization

XQuery Type checkSQLX rewrite

Execution engine

XQueryX

Compiled XQuery Tree

SQL MetadataXMLSchemaRepository

Statically Type checked Tree

Normalized Tree (casts, treat )

SQL/XML Operand Tree

Relational Optimizer

SQL Operand Tree

XML Indexes, Text Indexes

XQuery F&O

Execution Structures

SQL

XQUERY

40

40

79


Commercial Systems: Summary

� Data design:

�CLOBs

�Fixed canonical mappings

�Mappings in terms of annotated schema

� Querying:

�SQL as the main access method to XML documents “XML-aware”extensions to SQL

�E.g., Limited XPath navigation syntax ongoing Xquery support

� No support for

�context-dependent tuple construction

�recursive schema (Microsoft, IBM DB2)

�incrementing existing tables (Oracle)

�constraint propagation

80


Native Techniques

� Built from scratch

� NatiX (University of Mannheim, Germany)

� Xyleme (France)

� Xindice (Apache – open source)

� Re-tool existing systems to handle XML

� Tamino: hierarchical database (ADABAS)

� Excelon: OODB

� Design efficient data structures for compact storage and fast access; data partitioning; indexing on both values and structure

41

41

81


NatiX

� Unit of storage = element

� Elements clustered to minimize page hits

� Inter-element pointers capture document structure

� Low-level algorithmic support for read/write/insert/delete operations

� No use of DTDs or XML Schema

82


Xyleme

� Data layout: based on NatiX

� Indexing: sophisticated indexing of text and elements

� Query support: XPATH, XQuery, updates

� A data warehouse for XML content: store, classify, index, integrate, query and monitor massive volumes of XML content

� Semantic services: extensible thesauri and schema mappers that enable the system to go beyond simple indexing

42

42

83


Xyleme/Natix

84


eXcelon XIS

� Extends Object Store – an object-oriented database

� Data Layout: stores parsed nodes (accessible through DOM interface)

� Indexing: value indexes for strings and numbers; text indexes; structural indexes

� Query Support: DOM, XSLT, XPath, XQuery, updates

� Other features:

� Node-level management

� Data is stored in a pre-parsed format: only data objects needed for an operation are loaded into memory

� Create, add, delete, update elements directly

� Handles arbitrary XML documents - without Schema or DTD

� But can enforce schemas if necessary

� Triggers; transactions; distributed caching mechanism

43

43

85


Software A/G Tamino

� Extends Adabas – nested relations

� Indexing: value and structure

� Query support:

� Full-text search operators

� Queries return entire document or some projection of document

� No construction of new XML values (Ex: XQL)

� Access control at the node level, transactions; multi-media; triggers; backup/restore; compression; support for multi-media documents, e.g., video

86


XQuery Java implementation

� XQuery or XQueryX input

� Extensible function implementation

� Compiles into rowsource like structures

� Optimization – push XQuery to XMLDB

� XQJ API driver – for accessing mid tier/backend

� Shared data model with XSL/XPath

� Shared F&O – pre-defined & external

�Standard Function implementation interfaces

�Write Java func once – use it in XQuery/XSLT

44

44

87


DB Adapter

Java Execution engine

Java Layer Architecture

XQJ

API

Dr i ver

XQueryXQueryXXPath

XQJ API (Java)

conn = datasrc.getConnection();expr = conn.prepareExpression(

‘for $i in doc(“xxx”) return $i’);res = expr.executeQuery();while (res.next()) { … }

Push down query

select *from XMLTable(‘for $i in doc(“xx”) return $i);

User

XMLDB

SQL + XQuery orXQueryX

XQuery aware SQL engine

Rewrite to SQLX

Normalization

XQuery Type check

SQL

Compiler

XMLQuery, XMLTable

SQL/XML Rel optimizer, Execution engine

Compiler

XQuery Parser

SQLX rewrite

Query Pushdown

Normalization

Type check

Compiler

Parser

XDS data sources, files

Push down

XQuery Java Engine

88


References

� Serge Abiteboul,Sophie Cluet,Tova Milo: Querying and Updating the File. VLDB 1993

� D. Barbosa,A. Barta,A. Mendelzon,G. Mihaila, F. Rizzolo, P. Rodriguez-Gianolli: ToX – The Toronto XML Engine, International Workshop on Information Integration on the Web, Rio de Janeiro, 2001.

� Phil Bohannon, Juliana Freire, Prasan Roy, Jérôme Siméon: From XML Schema to Relations: A cost-based Approach to XML Storage. ICDE 2002

� Michael J. Carey,Jerry Kiernan, Jayavel Shanmugasundaram, Eugene J. Shekita, Subbu N. Subramanian: XPERANTO: Middleware for Publishing Object-Relational Data as XML Documents.VLDB 2000

� Qiming Chen, Yahiko Kambayashi: Nested Relation Based Database Knowledge Representation. SIGMOD Conference 1991

� Vassilis Christophides, Sophie Cluet, Jérôme Siméon: On Wrapping Query Languages and Efficient XML Integration. SIGMOD Conference 2000: 141-152

� Alin Deutsch, Mary F. Fernandez, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999

� A. Eisenberg, J. Melton: Advancements in SQL /XML. ACM SIGMOD Record, 2004.

45

45

89


References

� Daniela Florescu, Donald Kossman: A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. IEEE Data Eng. Bulletin 1999

� Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: XTRACT: A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Conference 2000

� Roy Goldman, Jennifer Widom: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997

� P.J. Marron, G. Lausen: On Processing XML in LDAP, VLDB 2001� Carl-Christian Kanne, Guido Moerkotte: Efficient Storage of XML Data.Technical Report 8/99, University of Mannheim, 1999

� Feng Tian, David J. DeWitt, Jianjun Chen, and Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies, Technical report, University of Wisconsin

� Masatoshi Yoshikawa, Takeyuki Shimura, Shunsuke Uemura: XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases

� Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, Guy M. Lohman: On Supporting Containment Queries in Relational Database Management Systems. SIGMOD 2001

� Justin Zobel, James A. Thom,Ron Sacks-Davis: Efficiency of Nested Relational Document Database Systems. VLDB 1991