Download - XML in Relational DBcstyng/webdb.07/lectures/lesson5.pdfXML Data Storage - Relational DB 5 Structure-mapping approach • Design database schema based on DTD • Different schemas

XML Data Storage - Relational DB 1

XML in Relational DB


A Sample File<SigmodRecord>

<issue><volume>11</volume><number>1</number><articles>

<article><title>Annotated Bibliography on Data Design.</title><initPage>45</initPage><endPage>77</endPage><authors>

<author position="00">Anthony I. Wasserman</author><author position="01">Karen Botnich</author>

</authors></article>

</articles></issue></SigmodRecord>


Corresponding Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1

Annotated Bibliographyon Data Design. 45 77

KarenBotnich

SigmodRecord

issue

number articlesvolume

article

endPage authorsinitPagetitle

author author

@position@position

00 01

element node

text node

attribute node


Two Mapping Approaches

• Structure-mapping approach– static DTD– well-defined semantic constraints

• Model-mapping approach– dynamic DTD– ambiguous semantic constraints


Structure-mapping approach

• Design database schema based on DTD

• Different schemas for different DTDs

• Slides partially from a VLDB 2002 Tutorial

Relational DB

DTD

XML Doc

XMLQuery

XML


A Sample DTD<!ELEMENT book (booktitle, author)<!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)>

<!ATTLIST author id ID #REQUIRED>

<!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor?)><!ELEMENT title (#PCDATA)><!ELEMENT contactauthor EMPTY>

<!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)><!ELEMENT editor (monograph*)>

<!ATTLIST editor name CDATA #REQUIRED>


A DTD Graphbook

booktitle

author

monograph

title

contactauthor

authorID

editor

*

nameaddress

?

firstname lastname

?

authorid

article

*

name

12 elements


Basic Inlining• An element graph is created for each element in the DTD

graph• Relations created for:

– Root of an element graph– Set value children (under a *)– Nodes with a back pointer (recursion)

• All other descendants inlined• Every element is made into as comprehensive a relation as

possible• Set-valued attributes are handled with separate relations

and foreign keys


Basic Inlining• An element graph is constructed by

– Do a DFS traversal of DTD graph, starting at element node for which we are constructing relations

– Each node is marked as “visited” the first time it is reached and is unmarked once all its children have been traversed

– If an unmarked node in DTD graph is reach during DFS, a new node bearing the same name is created in the element graph

– A regular edge is created from the most recently created node in the element graph with the same names as the DFS parent of the current DTD node to newly created node

– If an attempt is made to traverse an already marked DTD, then a backpointer edge is added from the most recently created node in the element graph to the most recently created node in the element graph of the same name as the marked DTD node

editor

name


Basic Inlining

• A relation is created for root of element of graph

• All element’s descendents are inlined into that relation except– Children below a “*” node

are made into separate relations – this corresponds to creating a new relation for a set-valued child

– Each node having a backpointer edge pointing to it is made into a separate relation


Basic: Relational Schemabook (bookID: integer, book.booktitle : string, book.author.name.firstname: string, book.author.name.lastname: string,

book.author.address: string, author.authorid: string)booktitle (booktitleID: integer, booktitle: string)article (articleID: integer, article.contactauthor.authorid: string, article.title: string)article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,

article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)contactauthor (contactauthorID: integer, contactauthor.authorid: string)title (titleID: integer, title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.title: string,

monograph.editor.name: string, monograph.author.name.firstname: string, monograph.author.name.lastname: string, monograph.author.address: string, monograph.author.authorid: string)

editor (editorID: integer, editor.parentID: integer, editor.name: string)editor.monograph (editor.monographID: integer, editor.monograph.parentID: integer, editor.monograph.title: string,

editor.monograph.author.name.firstname: string, editor.monograph.author.name.lastname: string, editor.monograph.author.address: string, editor.monograph.author.authorid: string)

author (authorID: integer, author.name.firstname: string, author.name.lastname: string, author.address: string,aauthor.authorid: string)

name (nameID: integer, name.firstname: string, name.lastname: string)firstname (firstnameID: integer, firstname: string)lastname (lastnameID: integer, lastname: string)address (addressID: integer, address: string)


Fragmentation Problem<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)>

<!ELEMENT firstname (#PCDATA)>

<!ELEMENT lastname (#PCDATA)>

<!ELEMENT address ANY>

author (authorID: integer, id: string)

name (nameID: integer, authorID: integer)

firstname (firstnameID: integer, nameID: integer, value: string)

lastname (lastnameID: integer, nameID: integer, value: string)

address (addressID: integer, authorID: integer, value: string)

Each element is a table


Problems in Basic In-lining• Creating a large amount of relations

– Separate relational schema for each element as root (minor)

– Unrolling recursive strongly connected components (major)

• Consuming great storage space when where are many null able(optional) attributes

• Too many joins– E.g. author becomes scattered (i.e., it appears

once for each possible instantiation from a root). Query: list all the books authored by Louis


Complexity of DTDs• DTD element specifications can be of

arbitrary complexity• <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!• Fortunately, can simplify DTD for

translation purposes:– Key observations: not necessary to regenerate

DTD from relational schema– XML query languages do not query over

complex structures of elements


Simplification of DTDs

(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?

e1** e1*e1*? e1*e1?* e1*e1?? e1?

..., a*, ..., a*, ... a*, ...

..., a*, ..., a?, ... a*, ...

..., a?, ..., a*, ... a*, ...

..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …

Flattening Transformations Simplification Transformations

Grouping Transformations

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

<!ELEMENT a (b*, c?, e*, f*)>


Shared Inlining• Idea

– Inline as many sub-elements as possible– However, to avoid scattering, do not inline an

element if it is shared– I.e. do not inline only if it is a shared, recursive or set

sub-element.

• Approach– Necessary and Sufficient Condition for shared/

recursive element: e.g. authorIn-degree >= 2 in DTD graph


Shared Inlining• Technique:

– Node with in-degree 1 in the DTD graph: inline

• Special cases: * and recursion– Node with in-degree 0: create a separate

relation, because it cannot be inlined– Node with in-degree greater than 1: create

a separate relation, because it is shared


Issues with Sharing Elements

• Parent of elements not fixed at schema level– Isroot: boolean variable

• Need to store type and ids of parents (or if there are no parents)– parentCODE field (type of parent)– parentID field (id of parent)– Not foreign key relationship


Shared: Relational Schemabook (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)article (articleID: integer, article.contactauthor.isroot: boolean,

article.contactauthor.authorid: string)monograph (monographID: integer, monograph.parentID: integer,

monograph.parentCODE: integer, monograph.editor.isroot: boolean,monograph.editor.name: string)

title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer,

author.name.isroot: boolean, author.name.firstname.isroot: :boolean,author.name.firstname: string, author.name.lastname.isroot: boolean,author.name.lastname: string, author.address.isroot: boolean,author.address: string, author.authorid: string)


Shared Inlining

• Advantages– Reduces number of joins for queries like “get

the first and last names of an author”– Efficient for queries such as “list all authors

with name Jack”• Problems

– Sharing whenever possible implies extra joins for path expressions. E.g. “Article with a given title name”


Converting XML Queries to SQL

• Converting queries with simple path expressions– Relation(s) corresponding

to start of root path expression(s) are identified and added to from clause of SQL query

– If necessary, path expressions are translated to joins among relations


Converting Relational Results to XML

• Simple Structuring– Constructing such

results from a RDBMS is natural and efficient

– Requires attaching appropriate tags for each tuple


Model-mapping approach

• Fixed database schema for different XML documents without the support of DTD– Edge (Florescu and Kossmann, 1999)– XRel (YoshiKawa, M. and Amagasa, T., 2001)– XParent (H. Jiang et al., 2002)– iNode (Lau & Ng, 2002)

• Support well-formed but non-DTD XML documents


Edge

• One-table schema– Edge(Source, Ordinal, Target, Label, Flag, Value)

• Each row corresponds to one edge in data graph

• Require table joins for edge connections• Cannot determine the parent-child

relationship directly


The Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1


KarenBotnich

SigmodRecord

issue


article


author author

@position@position

00 01

element node

text node

attribute node


EdgeSrc Ord Tgt Label Flag Value0 1 1 SigmodRecord ref1 1 2 issue ref2 1 3 volume val 112 2 4 number val 12 3 5 articles ref5 1 6 article ref6 1 7 title val Annotated ….6 2 8 initPage val 456 3 9 endPage val 776 4 10 authors ref10 1 11 author val Anthony I …11 1 12 @position val 0010 2 13 author val Karen Botnich13 1 14 @position val 01


XParent

• XParent– uses LabelPath and DataPath tables to maintain

the structural information of the XML documents.

– The DataPath table is also used to keep the parent-child relationship instead of using region as in XRel.


XParent

• Four-table schema– LabelPath(PathID, Len, Path)– DataPath(Pid, Cid)– Element(PathID, Ordinal, Did)– Data(PathID, Did, Ordinal, Value)

• Determine the parent-child relationship by using equijoin


The Document TreeRoot

1

2

7 8 9

42

208207 209 210

10481047

5232 5237Anthony I

Wasserman

11 1


KarenBotnich

SigmodRecord

issue


article


author author

@position@position

00 01

element node

text node

attribute node


XParentPID PathLen PathExp1 1 ./SigmodRecord2 2 ./SigmodRecord./issue3 3 ./SigmodRecord./issue./volume4 3 ./SigmodRecord./issue./number5 3 ./SigmodRecord./issue./articles6 4 ./SigmodRecord./issue./articles./article7 5 ./SigmodRecord./issue./articles./article./title8 5 ./SigmodRecord./issue./articles./article./initPage9 5 ./SigmodRecord./issue./articles./article./endPage10 5 ./SigmodRecord./issue./articles./article./authors11 6 ./SigmodRecord./issue./articles./article./authors./author12 7 ./SigmodRecord./issue./articles./article./authors./author./@position PID CID

1 22 32 42 55 65 156 76 86 96 1010 1110 1210 1310 14

LabelPath Table

DataPath Table


XParentPathID Ordinal Did1 1 12 1 23 1 34 1 45 1 56 1 67 1 78 1 89 1 910 1 1011 1 1112 1 1211 2 1312 1 14 PathID Did Ordinal Value

3 3 1 114 4 1 17 7 1 Annotated Bibliography on Data Design.8 8 1 459 9 1 7712 12 1 0011 11 1 Anthony I. Wasserman12 14 1 0111 13 2 Karen Botnich

Element Table

Data Table


INode

• Model-mapping approach• Node-oriented approach• Based on a node numbering scheme• Using the virtual node concept to calculate

the id for each node (Lee et al., 1996)


Key Features

1. Does not need to concatenate the edges to form a simple path for query processing

2. Reduces the database size by using a table to store all simple path expressions

3. The parent-child relationship is embedded in the NodeID

4. Retrieve the parent-child relationship by calculation instead of using equijoins orθ–joins


INode Numbering Scheme (1/4)

• Given a node n, denote the path of n asa1, a2, a3, …, akwhere k is the depth of node n

and 1 ≤ ai ≤ nc

1

32

1198 10 12 13

4

765

1,1 1,2 1,3

1,1,1

1( )⎥⎦⎥

⎢⎣⎢ +

−= 12)(

kiiParent

real node virtual node



• Based on the Parent(i) function, we have a path function as follow:

⎪⎪⎩

⎪⎪⎨

⎧

>++−

≤=

∑ ∑

∑

=

−

=−

−−

=

2,1

2,)( 2

1

3

0

22

2

1

kannan

kapathf

i

k

iik

ic

kci

kc

ii

Where nc = maximum number of child nodes for a node

k = path length of the node

ai = value in the path at the position i



• With the path function, we embed the document-id in the node-id with the following function, where i is the number of decimal places for document-id

i

docidpathfnodeid10

)( +=



• Given the node-ids of i and j, where i is the ancestor of j, and the depth difference between two nodes as d, we can confirm the ancestor-descendent relationship with the following function

⎣ ⎦jjdc

d

i

icj

i nodeidnodeidn

nnodeidnodeid −+

⎥⎥⎥⎥

⎦

⎥

⎢⎢⎢⎢

⎣

⎢

+−−

=∑−

= 12

1

1


Table Mapping

• Three-table schema– Path(PathID, PathExp)– Element(NodeID, PathID, Ordinal, Value)– Attribute(NodeID, PathID, Value)

• Determine the parent-child relationship by calculation


Querying - EdgeQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select authors.valuefrom edge sigmodrecord, edge issue, edge articles, edge article,

edge authors, edge author, edge endpagewhere sigmodreacord.label = ‘SigmodRecord’

and issue.label = ‘issue’and articles.label = ‘articles’and article.label = ‘article’and authors.label = ‘authors’and author.label = ‘author’and endpage.label = ‘endPage’and sigmodrecord.source = ‘0’and sigmodrecord.target = issue.sourceand issue.target = articles.sourceand articles.target = article.sourceand article.target = authors.sourceand endpages.source = authors.sourceand author.source = authors.targetand endpage.value = ‘77’


Querying - XParentQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author

select d1.valuefrom labelpath lp1, labelpath lp2, datapath dp1, datapath dp2, data d1, data d2

labelpath lp3, datapath dp3where lp1.path = ‘./SigmodRecord./issue./articles./article./authors/author'

and lp2.path = ‘./SigmodRecord./issue./articles./article./endPage’lp3.path = ‘./SigmodRecord./issue./articles./article./authors'and d1.pathid = lp1.pathidand d2.pathid = lp2.pathidand d1.did = dp1.cidand d2.did = dp2.cidand dp3.pid = dp2.pidand dp1.pid = dp3.cidand d2.value = ‘77’


Query Processing - iNode

Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors

select e1.valuefrom path p1, path p2, element e1, element e2where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors'

and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage'and e1.pathid = p1.pathidand e2.pathid = p2.pathidand mod(e1.nodeid, 1) = mod(e2.nodeid, 1)and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1)and e2.value = '77'


Discussion• Edge

– easy to maintain because it uses a singe table schema– Since it only maintains edges individually, it needs a

large number of equijoins to check the edge-connections.

• XRel– uses a table to store all simple path expressions. This

improves the query performance and performs regular path expressions easily.

– it can identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin.


Discussion

• XParent– uses a table to maintain the parent-child relationship

instead of using the concept of region. Therefore, it can use equijoins to test the relationship and the performance can be increased.

– For some complex queries that require checking the ancestor-descendent relationship, XParent requires a large number of equijoins to check for the relationship.

– Although it can use another table, called Ancestor, to keep the ancestor-descendent relationship, it requires more storage space.


End of Lecture

Download - XML in Relational DBcstyng/webdb.07/lectures/lesson5.pdfXML Data Storage - Relational DB 5 Structure-mapping approach • Design database schema based on DTD • Different schemas

Top Related