XML Data Storage - Relational DB 1
XML in Relational DB
XML Data Storage - Relational DB 2
A Sample File<SigmodRecord>
<issue><volume>11</volume><number>1</number><articles>
<article><title>Annotated Bibliography on Data Design.</title><initPage>45</initPage><endPage>77</endPage><authors>
<author position="00">Anthony I. Wasserman</author><author position="01">Karen Botnich</author>
</authors></article>
</articles></issue></SigmodRecord>
XML Data Storage - Relational DB 3
Corresponding Document TreeRoot
1
2
7 8 9
42
208207 209 210
10481047
5232 5237Anthony I
Wasserman
11 1
Annotated Bibliographyon Data Design. 45 77
KarenBotnich
SigmodRecord
issue
number articlesvolume
article
endPage authorsinitPagetitle
author author
@position@position
00 01
element node
text node
attribute node
XML Data Storage - Relational DB 4
Two Mapping Approaches
• Structure-mapping approach– static DTD– well-defined semantic constraints
• Model-mapping approach– dynamic DTD– ambiguous semantic constraints
XML Data Storage - Relational DB 5
Structure-mapping approach
• Design database schema based on DTD
• Different schemas for different DTDs
• Slides partially from a VLDB 2002 Tutorial
Relational DB
DTD
XML Doc
XMLQuery
XML
XML Data Storage - Relational DB 6
A Sample DTD<!ELEMENT book (booktitle, author)<!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)>
<!ATTLIST author id ID #REQUIRED>
<!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor?)><!ELEMENT title (#PCDATA)><!ELEMENT contactauthor EMPTY>
<!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)><!ELEMENT editor (monograph*)>
<!ATTLIST editor name CDATA #REQUIRED>
XML Data Storage - Relational DB 7
A DTD Graphbook
booktitle
author
monograph
title
contactauthor
authorID
editor
*
nameaddress
?
firstname lastname
?
authorid
article
*
name
12 elements
XML Data Storage - Relational DB 8
Basic Inlining• An element graph is created for each element in the DTD
graph• Relations created for:
– Root of an element graph– Set value children (under a *)– Nodes with a back pointer (recursion)
• All other descendants inlined• Every element is made into as comprehensive a relation as
possible• Set-valued attributes are handled with separate relations
and foreign keys
XML Data Storage - Relational DB 9
Basic Inlining• An element graph is constructed by
– Do a DFS traversal of DTD graph, starting at element node for which we are constructing relations
– Each node is marked as “visited” the first time it is reached and is unmarked once all its children have been traversed
– If an unmarked node in DTD graph is reach during DFS, a new node bearing the same name is created in the element graph
– A regular edge is created from the most recently created node in the element graph with the same names as the DFS parent of the current DTD node to newly created node
– If an attempt is made to traverse an already marked DTD, then a backpointer edge is added from the most recently created node in the element graph to the most recently created node in the element graph of the same name as the marked DTD node
editor
name
XML Data Storage - Relational DB 10
Basic Inlining
• A relation is created for root of element of graph
• All element’s descendents are inlined into that relation except– Children below a “*” node
are made into separate relations – this corresponds to creating a new relation for a set-valued child
– Each node having a backpointer edge pointing to it is made into a separate relation
XML Data Storage - Relational DB 11
Basic: Relational Schemabook (bookID: integer, book.booktitle : string, book.author.name.firstname: string, book.author.name.lastname: string,
book.author.address: string, author.authorid: string)booktitle (booktitleID: integer, booktitle: string)article (articleID: integer, article.contactauthor.authorid: string, article.title: string)article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,
article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)contactauthor (contactauthorID: integer, contactauthor.authorid: string)title (titleID: integer, title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.title: string,
monograph.editor.name: string, monograph.author.name.firstname: string, monograph.author.name.lastname: string, monograph.author.address: string, monograph.author.authorid: string)
editor (editorID: integer, editor.parentID: integer, editor.name: string)editor.monograph (editor.monographID: integer, editor.monograph.parentID: integer, editor.monograph.title: string,
editor.monograph.author.name.firstname: string, editor.monograph.author.name.lastname: string, editor.monograph.author.address: string, editor.monograph.author.authorid: string)
author (authorID: integer, author.name.firstname: string, author.name.lastname: string, author.address: string,aauthor.authorid: string)
name (nameID: integer, name.firstname: string, name.lastname: string)firstname (firstnameID: integer, firstname: string)lastname (lastnameID: integer, lastname: string)address (addressID: integer, address: string)
XML Data Storage - Relational DB 12
Fragmentation Problem<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT address ANY>
author (authorID: integer, id: string)
name (nameID: integer, authorID: integer)
firstname (firstnameID: integer, nameID: integer, value: string)
lastname (lastnameID: integer, nameID: integer, value: string)
address (addressID: integer, authorID: integer, value: string)
Each element is a table
XML Data Storage - Relational DB 13
Problems in Basic In-lining• Creating a large amount of relations
– Separate relational schema for each element as root (minor)
– Unrolling recursive strongly connected components (major)
• Consuming great storage space when where are many null able(optional) attributes
• Too many joins– E.g. author becomes scattered (i.e., it appears
once for each possible instantiation from a root). Query: list all the books authored by Louis
XML Data Storage - Relational DB 14
Complexity of DTDs• DTD element specifications can be of
arbitrary complexity• <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!• Fortunately, can simplify DTD for
translation purposes:– Key observations: not necessary to regenerate
DTD from relational schema– XML query languages do not query over
complex structures of elements
XML Data Storage - Relational DB 15
Simplification of DTDs
(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?
e1** e1*e1*? e1*e1?* e1*e1?? e1?
..., a*, ..., a*, ... a*, ...
..., a*, ..., a?, ... a*, ...
..., a?, ..., a*, ... a*, ...
..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …
Flattening Transformations Simplification Transformations
Grouping Transformations
<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>
<!ELEMENT a (b*, c?, e*, f*)>
XML Data Storage - Relational DB 16
Shared Inlining• Idea
– Inline as many sub-elements as possible– However, to avoid scattering, do not inline an
element if it is shared– I.e. do not inline only if it is a shared, recursive or set
sub-element.
• Approach– Necessary and Sufficient Condition for shared/
recursive element: e.g. authorIn-degree >= 2 in DTD graph
XML Data Storage - Relational DB 17
Shared Inlining• Technique:
– Node with in-degree 1 in the DTD graph: inline
• Special cases: * and recursion– Node with in-degree 0: create a separate
relation, because it cannot be inlined– Node with in-degree greater than 1: create
a separate relation, because it is shared
XML Data Storage - Relational DB 18
Issues with Sharing Elements
• Parent of elements not fixed at schema level– Isroot: boolean variable
• Need to store type and ids of parents (or if there are no parents)– parentCODE field (type of parent)– parentID field (id of parent)– Not foreign key relationship
XML Data Storage - Relational DB 19
Shared: Relational Schemabook (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)article (articleID: integer, article.contactauthor.isroot: boolean,
article.contactauthor.authorid: string)monograph (monographID: integer, monograph.parentID: integer,
monograph.parentCODE: integer, monograph.editor.isroot: boolean,monograph.editor.name: string)
title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer,
author.name.isroot: boolean, author.name.firstname.isroot: :boolean,author.name.firstname: string, author.name.lastname.isroot: boolean,author.name.lastname: string, author.address.isroot: boolean,author.address: string, author.authorid: string)
XML Data Storage - Relational DB 20
Shared Inlining
• Advantages– Reduces number of joins for queries like “get
the first and last names of an author”– Efficient for queries such as “list all authors
with name Jack”• Problems
– Sharing whenever possible implies extra joins for path expressions. E.g. “Article with a given title name”
XML Data Storage - Relational DB 21
Converting XML Queries to SQL
• Converting queries with simple path expressions– Relation(s) corresponding
to start of root path expression(s) are identified and added to from clause of SQL query
– If necessary, path expressions are translated to joins among relations
XML Data Storage - Relational DB 22
Converting Relational Results to XML
• Simple Structuring– Constructing such
results from a RDBMS is natural and efficient
– Requires attaching appropriate tags for each tuple
XML Data Storage - Relational DB 23
Model-mapping approach
• Fixed database schema for different XML documents without the support of DTD– Edge (Florescu and Kossmann, 1999)– XRel (YoshiKawa, M. and Amagasa, T., 2001)– XParent (H. Jiang et al., 2002)– iNode (Lau & Ng, 2002)
• Support well-formed but non-DTD XML documents
XML Data Storage - Relational DB 24
Edge
• One-table schema– Edge(Source, Ordinal, Target, Label, Flag, Value)
• Each row corresponds to one edge in data graph
• Require table joins for edge connections• Cannot determine the parent-child
relationship directly
XML Data Storage - Relational DB 25
The Document TreeRoot
1
2
7 8 9
42
208207 209 210
10481047
5232 5237Anthony I
Wasserman
11 1
Annotated Bibliographyon Data Design. 45 77
KarenBotnich
SigmodRecord
issue
number articlesvolume
article
endPage authorsinitPagetitle
author author
@position@position
00 01
element node
text node
attribute node
XML Data Storage - Relational DB 26
EdgeSrc Ord Tgt Label Flag Value0 1 1 SigmodRecord ref1 1 2 issue ref2 1 3 volume val 112 2 4 number val 12 3 5 articles ref5 1 6 article ref6 1 7 title val Annotated ….6 2 8 initPage val 456 3 9 endPage val 776 4 10 authors ref10 1 11 author val Anthony I …11 1 12 @position val 0010 2 13 author val Karen Botnich13 1 14 @position val 01
XML Data Storage - Relational DB 27
XParent
• XParent– uses LabelPath and DataPath tables to maintain
the structural information of the XML documents.
– The DataPath table is also used to keep the parent-child relationship instead of using region as in XRel.
XML Data Storage - Relational DB 28
XParent
• Four-table schema– LabelPath(PathID, Len, Path)– DataPath(Pid, Cid)– Element(PathID, Ordinal, Did)– Data(PathID, Did, Ordinal, Value)
• Determine the parent-child relationship by using equijoin
XML Data Storage - Relational DB 29
The Document TreeRoot
1
2
7 8 9
42
208207 209 210
10481047
5232 5237Anthony I
Wasserman
11 1
Annotated Bibliographyon Data Design. 45 77
KarenBotnich
SigmodRecord
issue
number articlesvolume
article
endPage authorsinitPagetitle
author author
@position@position
00 01
element node
text node
attribute node
XML Data Storage - Relational DB 30
XParentPID PathLen PathExp1 1 ./SigmodRecord2 2 ./SigmodRecord./issue3 3 ./SigmodRecord./issue./volume4 3 ./SigmodRecord./issue./number5 3 ./SigmodRecord./issue./articles6 4 ./SigmodRecord./issue./articles./article7 5 ./SigmodRecord./issue./articles./article./title8 5 ./SigmodRecord./issue./articles./article./initPage9 5 ./SigmodRecord./issue./articles./article./endPage10 5 ./SigmodRecord./issue./articles./article./authors11 6 ./SigmodRecord./issue./articles./article./authors./author12 7 ./SigmodRecord./issue./articles./article./authors./author./@position PID CID
1 22 32 42 55 65 156 76 86 96 1010 1110 1210 1310 14
LabelPath Table
DataPath Table
XML Data Storage - Relational DB 31
XParentPathID Ordinal Did1 1 12 1 23 1 34 1 45 1 56 1 67 1 78 1 89 1 910 1 1011 1 1112 1 1211 2 1312 1 14 PathID Did Ordinal Value
3 3 1 114 4 1 17 7 1 Annotated Bibliography on Data Design.8 8 1 459 9 1 7712 12 1 0011 11 1 Anthony I. Wasserman12 14 1 0111 13 2 Karen Botnich
Element Table
Data Table
XML Data Storage - Relational DB 32
INode
• Model-mapping approach• Node-oriented approach• Based on a node numbering scheme• Using the virtual node concept to calculate
the id for each node (Lee et al., 1996)
XML Data Storage - Relational DB 33
Key Features
1. Does not need to concatenate the edges to form a simple path for query processing
2. Reduces the database size by using a table to store all simple path expressions
3. The parent-child relationship is embedded in the NodeID
4. Retrieve the parent-child relationship by calculation instead of using equijoins orθ–joins
XML Data Storage - Relational DB 34
INode Numbering Scheme (1/4)
• Given a node n, denote the path of n asa1, a2, a3, …, akwhere k is the depth of node n
and 1 ≤ ai ≤ nc
1
32
1198 10 12 13
4
765
1,1 1,2 1,3
1,1,1
1( )⎥⎦⎥
⎢⎣⎢ +
−= 12)(
kiiParent
real node virtual node
XML Data Storage - Relational DB 35
INode Numbering Scheme (2/4)
• Based on the Parent(i) function, we have a path function as follow:
⎪⎪⎩
⎪⎪⎨
⎧
>++−
≤=
∑ ∑
∑
=
−
=−
−−
=
2,1
2,)( 2
1
3
0
22
2
1
kannan
kapathf
i
k
iik
ic
kci
kc
ii
Where nc = maximum number of child nodes for a node
k = path length of the node
ai = value in the path at the position i
XML Data Storage - Relational DB 36
INode Numbering Scheme (3/4)
• With the path function, we embed the document-id in the node-id with the following function, where i is the number of decimal places for document-id
i
docidpathfnodeid10
)( +=
XML Data Storage - Relational DB 37
INode Numbering Scheme (4/4)
• Given the node-ids of i and j, where i is the ancestor of j, and the depth difference between two nodes as d, we can confirm the ancestor-descendent relationship with the following function
⎣ ⎦jjdc
d
i
icj
i nodeidnodeidn
nnodeidnodeid −+
⎥⎥⎥⎥
⎦
⎥
⎢⎢⎢⎢
⎣
⎢
+−−
=∑−
= 12
1
1
XML Data Storage - Relational DB 38
Table Mapping
• Three-table schema– Path(PathID, PathExp)– Element(NodeID, PathID, Ordinal, Value)– Attribute(NodeID, PathID, Value)
• Determine the parent-child relationship by calculation
XML Data Storage - Relational DB 39
Querying - EdgeQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author
select authors.valuefrom edge sigmodrecord, edge issue, edge articles, edge article,
edge authors, edge author, edge endpagewhere sigmodreacord.label = ‘SigmodRecord’
and issue.label = ‘issue’and articles.label = ‘articles’and article.label = ‘article’and authors.label = ‘authors’and author.label = ‘author’and endpage.label = ‘endPage’and sigmodrecord.source = ‘0’and sigmodrecord.target = issue.sourceand issue.target = articles.sourceand articles.target = article.sourceand article.target = authors.sourceand endpages.source = authors.sourceand author.source = authors.targetand endpage.value = ‘77’
XML Data Storage - Relational DB 40
Querying - XParentQ1: /SigmodRecord/issue/articles/article[endPage=77]/authors/author
select d1.valuefrom labelpath lp1, labelpath lp2, datapath dp1, datapath dp2, data d1, data d2
labelpath lp3, datapath dp3where lp1.path = ‘./SigmodRecord./issue./articles./article./authors/author'
and lp2.path = ‘./SigmodRecord./issue./articles./article./endPage’lp3.path = ‘./SigmodRecord./issue./articles./article./authors'and d1.pathid = lp1.pathidand d2.pathid = lp2.pathidand d1.did = dp1.cidand d2.did = dp2.cidand dp3.pid = dp2.pidand dp1.pid = dp3.cidand d2.value = ‘77’
XML Data Storage - Relational DB 41
Query Processing - iNode
Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors
select e1.valuefrom path p1, path p2, element e1, element e2where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors'
and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage'and e1.pathid = p1.pathidand e2.pathid = p2.pathidand mod(e1.nodeid, 1) = mod(e2.nodeid, 1)and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1)and e2.value = '77'
XML Data Storage - Relational DB 42
Discussion• Edge
– easy to maintain because it uses a singe table schema– Since it only maintains edges individually, it needs a
large number of equijoins to check the edge-connections.
• XRel– uses a table to store all simple path expressions. This
improves the query performance and performs regular path expressions easily.
– it can identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin.
XML Data Storage - Relational DB 43
Discussion
• XParent– uses a table to maintain the parent-child relationship
instead of using the concept of region. Therefore, it can use equijoins to test the relationship and the performance can be increased.
– For some complex queries that require checking the ancestor-descendent relationship, XParent requires a large number of equijoins to check for the relationship.
– Although it can use another table, called Ancestor, to keep the ancestor-descendent relationship, it requires more storage space.
XML Data Storage - Relational DB 44
End of Lecture