xml storage - 1 xml storage and querying. xml storage - 2 today's question 1.relational xml...
TRANSCRIPT
XML Storage - 1
XML Storage and Querying
XML Storage - 2
Today's Question
1.Relational XML Data Storage
2.XML Document Querying
XML Storage - 3
1. XML Data Storage• Problem: persistent storage of (large sets of large) XML documents
• Goal: find a storage system that supports both efficient search and updates
• Alternatives for XML database systems– Reuse of existing technologies for persistent data management
• Relational databases• Object-oriented databases• Text retrieval systems
– Native XML data storage systems
XML Storage - 4
Relational XML Data Storage• Advantage: reuse of existing robust and scalable technology• Disadvantage: Data model mismatch
– Hierarchical vs. relational data– Optional elements in XML schemas
Relational Database System
XML Translation Layer
XMLSchema
Relational Schema
TranslationInformation
XML Documents
Tuples
XMLQuery
SQLQuery
RelationalResult
XMLResult
XML Storage - 5
What Do You Think ?
• How to store XML documents in a relational database ?
• What problems need to be addressed ?
XML Storage - 6
Edge Table Approach
source ordinal name flag target
1 1 booktitle string v1
1 2 author ref 2
2 1 name ref 3
2 2 address ref 4
3 1 fi rstnamestring v2
3 2 lastname string v3
4 1 city string v4
4 2 zip int v5
id value
v5 99999
id value
v1 The Self
v2 Richard
v3 Dawkins
v4 Timbuktu
v6 dawkins
source ordinal attrname flag target
1 2 id string v6
Edge table EDGE
Integer table INTString table STR
Attribute table ATT
Store all edges of the XML document graph in one table
1 <book> <booktitle> The Selfish Gene </booktitle>2 <author id = “dawkins”>3 <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name>4 <address> <city> Timbuktu </city> <zip> 99999 </zip> </address> </author> </book>
v 1
book
book title
a u tho r
name
v 3v 2
firs tnam e la s tname
add re s s
v 5v 4
c ity zip
1
1 2
3 4
1 21 2
1 2
XML Storage - 7
Example: Processing of Path Queries
• The XPath query
//author/name/firstname
• Requires evaluation of the following SQL Query
SELECT S.value
FROM EDGE E1, EDGE E2, EDGE E3, EDGE E4, STR S
WHERE E1.target=E2.source AND E2.name="author" AND E2.target=E3.source AND E3.name="name" AND E3.target=E4.source AND E4.name="firstname"
AND E4.target=S.id
XML Storage - 8
Step 1: Typed Element Tables• Given three element types A, B, C, no DTD
a1
b1 c1
b2
c2 c3
a2
b3
t1 t2 t3 t4 t5
etc.
Id_A Parent_A
Content_A
a1 NULL NULL
a2 NULL NULL
Id_B Parent_B
Content_B
b1 a1 t1
b2 NULL NULL
b3 a2 t5
Id_C Parent_C
Content_C
c1 a1 t2
c2 b2 t3
c3 b2 t4
XML Storage - 9
Step 2: Exploiting a DTD
• Given three element types A, B, C, with DTD
<!ELEMENT A (B, C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>
a1
b1 c1
a2
b2 c3
t1 t2 t3 t4
etc.
Id_A Content_B
Content_C
a1 t1 t2
a2 t3 t4
_1 t5 NULL
_2 NULL t6
CREATE TABLE A( Id_A INTEGER, Content_B CHAR, Content_C CHAR PRIMARY KEY Id_A)
A
B C
b3 c4
t5 t6
XML Storage - 10
Integrity Constraints with Inlining(oct12)• Given three element types A, B, C, with DTD
<!DOCTYPE A [ <!ELEMENT A (B, C)> <!ELEMENT B #PCDATA> <!ELEMENT C #PCDATA>]>
a1
b1 c1
a2
b2 c3
t1 t2 t3 t4
etc.
Id_A Content_B
Content_C
a1 t1 t2
a2 t3 t4
CREATE TABLE A( Id_A INTEGER, Content_B CHAR NOT NULL, Content_C CHAR NOT NULL PRIMARY KEY Id_A)
A
B C
b3 c4
t5 t6
Given a DTD graph, a node is inlinable if and only if it hasexactly one incoming edge and that edge is a ,-edge.
XML Storage - 11
Step 3: Optional Elements• Given three element types A, B, C, with DTD
<!ELEMENT A (B?, C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>
• <!ELEMENT A (B | C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>
CREATE TABLE A( Id_A INTEGER, Content_B CHAR, Content_C CHAR PRIMARY KEY Id_A )
Optional elements have no impact on structure (just on constraints) !
A
B C
A
B C
?
XML Storage - 12
Step 4: Repeated Elements• Given three element types A, B, C, with DTD
<!ELEMENT A (B*,C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>
a1
c1
a2
b3 c2
t3 t4 t5
etc.b2b1
t1 t2
Id_A Content_C
a1 t2
a2 t5
Id_B Parent_B Content_B
b1 a1 t1
b2 a1 t2
b3 a2 t4
foreign key
CREATE TABLE B( Id_B INTEGER, Parent_B INTEGER , Content_B CHAR PRIMARY KEY Id_B FOREIGN KEY Parent_B REFERENCES A(Id_A))
A
B C
*
XML Storage - 13
Step 5: Shared Elements• Given three element types A, B, C, with DTD
<!ELEMENT A (C)><!ELEMENT B (C)><!ELEMENT C #PCDATA>
a1
c1
t1
etc.A B
C
b1
c2
t2
a2
c3
t3
b2
c4
t4
Id_A Content_C
a1 t1
a2 t3
Id_B Content_C
b1 t2
b2 t4
XML Storage - 14
Nested Content Models• The following is required (Completeness)
– Any XML DTD X must be mappable to a relational schema R– Any XML document XD conforming to DTD X must be mappable
to relational tuples in RD conforming to schema R– Any XML query against X must be mappable to an SQL query
against R– The document XD must be recoverable from the database RD
• The following is not required !– the DTD X must be recoverable from the schema R !
• Therefore we may modify the DTD before mapping to a relational schema
XML Storage - 15
Normalization of Content Models• Normalization steps:
– eliminate choice: (A|B) -> (A?, B?)– distribute occurrence operators: (A, B)* -> (A*, B*), (A, B)? -> (A?, B?)– contract occurrence operators: A** -> A*, A?? -> A?, A*? -> A*, A?* -> A* – contract elements: (A, B, A) -> (A*,B) (analogously for A*, A? instead of A)
• Example:
(A?, (B*, A, (B|C)), B) -> (eliminate choice)
(A?, (B*, A, (B?, C?)), B) -> (eliminate paranthesis)
(A?, B*, A, B?, C?, B) -> (contract element A)
(A*, B*, B?, C?, B) -> (contract element B) (A*, B*, C?)
XML Storage - 16
DTD Graph• For a DTD with normalized content models we can generate a DTD
graph– each element type is represented by exactly one node– arrows connect elements with their subelements– arrows are marked with *, + or ? according to the content model– attributes are connected to their element
A
B C
A
B C
?A
B C
*A B
C
XML Storage - 17
Creation of Relations (Top Nodes)
1. not reachable from another node
2. direct child of * or + marked edge
3. recursive node with in-degree >1
4. one node among two (or more) mutually recursive nodes with in-degree = 1
5. nodes with in-degree >1 that are reachable from the same top node by different paths composed exclusively of non-top nodes
A
A
*
B A
A B
XML Storage - 18
The "Hybrid" Inlining Algorithm
1. Normalize all content models
2. Create the DTD graph
3. Mark top nodes
4. Create a table for top nodes with a (system-generated) identifier.
5. Create attributes for leaf nodes; compose names by concatenating the element/attribute names along path from top node to leaf node.
6. Create an attribute root_type to record the root of the document.
7. If a top node is direct descendant of several top nodes, add a parent_type attribute to denote the type of the parent top node
8. If a top node is direct descendant of a top node add a parent_element attribute for keeping the foreign key of the parent top node1
9. If a node is child of multiple nodes but direct descendant of a single top node add a childof_type attribute to denote the type of the parent node
1 here direct descendant means direct descendant in the graph of top nodes only
XML Storage - 19
Example DTD
<!ELEMENT book (booktitle, author, citation)><!ELEMENT article (title, author, citation*)><!ELEMENT author (name)><!ATTLIST author id ID #REQUIRED><!ELEMENT name ((firstname, lastname)| lastname)><!ELEMENT citation (article|book)>
Step 1: Normalizing the DTD
((firstname, lastname)| lastname) -> ((firstname,lastname)?,lastname?)-> (firstname?, lastname?, lastname?) -> (firstname?, lastname*)
XML Storage - 20
Step 2-8: DTD Graph/Top Nodes/Relations
book article
authorbooktitle
name
firstname lastname
citation
authorid
*
*?
title
Book(ID_book, root_type, book_booktitle, book_author_authorid, book_author_name_firstname)Article(ID_article, root_type, article_title, article_author_authorid, article_author_name_firstname, parent_article)citation(ID_citation, root_type, parent_type, parent_citation)lastname(ID_lastname, root_type, parent_type, parent_lastname, lastname)
XML Storage - 21
Example Documents<book> <booktitle>Complexity</booktitle> <author authorid=1> <name><lastname>Karp</lastname></name> </author></book>
<article> <title>NP=P</title> <author authorid=2> <name><firstname>Li</firstname><lastname>Yu</lastname></name> </author> <citation> <book> <booktitle>Complexity Theory</booktitle> <author authorid=3> <name><lastname>Shamir</lastname></name> </author> </book> </citation></article>
XML Storage - 22
Relational TablesID_book root_type book_booktitle book_author_authorid book_author_name_f
irstname
b1 book Complexity 1 NULL
b2 book Complexity Theory
3 NULL
ID_article root_type article_title article_author_authorid article_author_name_firstname
a1 article NP=P 2 Li
ID_citation
root_type parent_type parent_citation
c1 article book b2
ID_lastname
root_type parent_type parent_lastname
lastname
l1 book book b1 Karp
l2 article article a1 Yu
l3 article book b2 Shamir
XML Storage - 23
Processing of XML Queries• Requires 3 Steps
– Step 1: Conversion of XML Queries against the DTD to SQL queries against the relational storage schema
– Step 2: Evaluation of the relational query– Step 3: Conversion of the relational query result into an XML
document
• Difficulty– Step 1 is straightforward– Step 2 performed by RDBMS– Step 3 simple for XPath queries (only elements are returned),
but very difficult for structured query results
XML Storage - 24
Conversion of XPath Queries• Conversion of path expressions
– Start from the root of the query– Inlined elements can be directly accessed– Elements in other relations are accessed via joins
• Recursive path expressions (queries containing // )– Require transformation into an SQL fixpoint query
(nonstandard feature)
• Example: The XPath query
/book/author/name/lastname
is converted to the following SQL Query
SELECT L.lastname FROM book B, lastname LWHERE B.ID_book=L.parent_lastname
XML Storage - 25
2. XML Document Filtering(oct20)
• Architecture of an XML Based SDI System
XML Conversion
SDI FilterEnglish Filtered Data
Data source
XML DocumentsUsers
User Profiles
SDI-Selective Dissemination of Information
XML Storage - 26
Introduction(OCt23)• A large number of users wants to receive XML documents from a document
stream according to individual profiles
– News feeds, stock tickers, … (compare data broadcast)
– Documents are distributed individually to users (unicast)
– Individual profiles expressed as XPath queries
• Goal: efficiently match each single incoming document against a large number of user profiles
– Minimize processing time
– Scalability in the number of queries
FilterEngine
incomingXML documents
push to subscribed users
user profiles
XM
L
Pars
er
events
XPath Parser
Dis
sem
inatio
n
filtereddocs
Profiles
XML Storage - 27
What Do You Think ?
• Possible approaches to filter XML documents given a large number of XPath expressions as profiles ?Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E
XML Storage - 28
Event-based XML Parsing (SAX)
<A> start(A, 1)<B> start(B, 2)
text PCDATA(3)</B> end(B)<A> start(A, 2)
<C> start(C, 3)<D> start(D, 4)text PCDATA(5)</D> end(D)
</C> end(C)</A> end(A)
</A> end(A)
event stream
XML Storage - 29
Step 1: Prefiltering Event Stream
Example Query Set:Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E
<A> start(A, 1) Q1, Q3, Q5 ok<B> start(B, 2) Q2, Q4, oktext PCDATA(3)</B> end(B)<A> start(A, 2)
<C> start(C, 3)<D> start(D, 4)text PCDATA(5)</D> end(D)
</C> end(C)</A> end(A)
</A> end(A)
XML Storage - 30
Step 2: Building a Query TreeA B
B C C D
D EDC E
Q1 Q3 Q5 Q2 Q4
<A> start A<B> start Btext PCDATA</B> end B<A> start A
<C> start C Q1 ok<D> start D Q2, Q3 oktext PCDATA
</D> end D</C> end C
</A> end A</A> end A
Example Query Set:Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E
XML Storage - 31
Compacting the Trie
A B
B C C D
D EDC E
Q1(3) Q3(3) Q5(3) Q2(3) Q4(3)
Q1(2)
Q1(1),Q3(1),Q5(1) Q2(1),Q4(1)
Q1(0),Q2(0),Q3(0),Q4(0),Q5(0)
Q4(2)Q2(2)Q3(2),Q5(2)
A B C D E
Q1(1)Q3(1)Q5(1)
Q2(1)Q4(1)
Q1(2) Q1(3)Q3(2)Q2(2)Q5(2)
Q2(3)Q3(3)Q4(2)
Q4(3)Q5(3)
candidate lists
waiting lists
whenever a member of the candidate list is matched promote the next member of the waiting list for the respective query
hash table
Example Query Set:Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E
XML Storage - 32
Step 3: Adding Level Information
Q(p,l,r): p position of element in XPath expressionl required level of element in documentr relative level with respect to previous element(na = not applicable, undet = undetermined)
Q(p,l,r) is called a pathnode
Q1 = /A/B//C Q1(1,1,na), Q1(2,2,1), Q1(3,undet,undet)Q2 = //B/*/C/D Q2(1,undet,na), Q2(2,undet,2), Q2(3,undet,1)Q3 = /*/A/C//D Q3(1,2,na), Q3(2,3,1), Q3(3,undet,undet)Q4 = //B/D/E Q4(1,undet,undet), Q4(2,undet,1), Q4(3,undet,1)Q5 = /A/*/*/C//E Q5(1,1,na), Q5(2,4,3), Q5(3,undet,undet)
level l can be dynamically determined (using r) while document is parsed
XML Storage - 33
Processing of Start Element Events
If start(element, level) arrives:
Step 1: obtain candidate list for element
Step 2: for each candidate entry Q(p, l, r)if l > 0 then
if l = level then continue else no match else continue
Step 3: check other conditions if applicable (attributes, text content)
Step 4: if continue and conditions are satisfied if last node of query path: successful match of profile else promote the next node corresponding to Q from the
waiting list to the candidate list
XML Storage - 34
Promotion from Waiting List
Promoting the next entry Q( p+1, lw, rw)
Step 1: add a copy of Q( p+1, lw, rw) from the waiting listto the candidate list/* Copying is required as the node may appear again later (multiple occurrences of the same element) */
Step 2: update the level value
if rw not undet then lw = rw + level else lw = undet
/* The character elements are processed analogously */
XML Storage - 35
Processing of End Element Events
If end(element) arrives:
Remove the entry in the candidate list that has been entered when the corresponding start(element,level) was encountered
Example: Q1 = /A/B//C Q1(1,1,na), Q1(2,2,1), Q1(3,undet,undet)
At the beginning Q1(1,1,na) corresponding to A is in the candidate listWhen A is encountered Q1(2,2,1) is promotedWhen end A is encountered Q1(2,2,1) is removed (the pathnode corresponding to B which is following A in the query)
XML Storage - 36
ExampleDocument
<A><B>text</B><C></C><B><C><D></D></C></B>
</A>
Document events
start(A, 1), start(B, 2), PCDATA(3), end(B, 2), start(C, 2), end(C, 2), start(b, 2), start(C, 3), start(D, 4), end(D, 4), end(C, 3), end(B, 2), end(A, 1)
Query Q = /A/B//D
Pathnodes Q(1, 1, na), Q(2, 2, 1), Q(3, undet, undet)
XML Storage - 37
Event Processing - ExampleEvent Processing Candidate list
init A: Q(1,1,na), B:_, C:_, D:_, E:_
start(A,1) promote Q(2, 2, 1) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_
start(B,2) promote Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_
PCDATA(3)
end(B) remove Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_
start(C,2)
end(C,2)
start(B,2) promote Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_
start(C,3)
start(D,4) match A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_
end(D)
end(C)
end(B) remove Q(3, 3, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_
end(A) remove Q(2, 2, 1) A: Q(1,1,na), B:_, C:_, D:_, E:_
XML Storage - 38
Event Processing - Illustration
promoteQ(3, undet, undet)
removeQ(3, 3, undet)
removeQ(3, undet, undet)B
A
C B
pcdata C
D
promote Q(2, 2, 1)
promote Q(3, undet, undet)
Q=/A/B//D
match
removeQ(2, 2, 1)
XML Storage - 39
Summary
• How can database query processing and message filtering be compared ?
• Which approaches did we introduce to perform XML filtering ?
• Which properties are stored in path nodes ? Why is the relative position required in a path node ?
• Which properties of path nodes are modified during document event processing ? What needs to be done in document event processing when an end element tag arrives ?
• Why ist postprocesssing of filter conditions required ?
• Why is it not efficient to build candidate lists based on the initial elements of query paths ?
XML Storage - 40
References• Course material based on
– Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314
– Dongwon Lee, Wesley W. Chu: CPI: Constraints-Preserving Inlining algorithm for mapping XML DTD to relational schema. DKE 39(1): 3-25 (2001)
– Daniela Florescu, Donald Kossmann: Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin 22(3): 27-34 (1999)
– Mehmet Altinel, Michael J. Franklin: Efficient Filtering of XML Documents for Selective Dissemination of Information. VLDB 2000: 53-64
• More background material– http://www.acm.org/sigmod/record/index.html
SIGMOD RECORD, Volume 30, Number 3, September 2001Special Section on Advanced XML Data Processing (ed. K. Aberer)