xml storage - 1 xml storage and querying. xml storage - 2 today's question 1.relational xml...

XML Storage - 1

XML Storage and Querying

XML Storage - 2

Today's Question

1.Relational XML Data Storage

2.XML Document Querying

XML Storage - 3

1. XML Data Storage• Problem: persistent storage of (large sets of large) XML documents

• Goal: find a storage system that supports both efficient search and updates

• Alternatives for XML database systems– Reuse of existing technologies for persistent data management

• Relational databases• Object-oriented databases• Text retrieval systems

– Native XML data storage systems

XML Storage - 4

Relational XML Data Storage• Advantage: reuse of existing robust and scalable technology• Disadvantage: Data model mismatch

– Hierarchical vs. relational data– Optional elements in XML schemas

Relational Database System

XML Translation Layer

XMLSchema

Relational Schema

TranslationInformation

XML Documents

Tuples

XMLQuery

SQLQuery

RelationalResult

XMLResult

XML Storage - 5

What Do You Think ?

• How to store XML documents in a relational database ?

• What problems need to be addressed ?

XML Storage - 6

Edge Table Approach

source ordinal name flag target

1 1 booktitle string v1

1 2 author ref 2

2 1 name ref 3

2 2 address ref 4

3 1 fi rstnamestring v2

3 2 lastname string v3

4 1 city string v4

4 2 zip int v5

id value

v5 99999

id value

v1 The Self

v2 Richard

v3 Dawkins

v4 Timbuktu

v6 dawkins

source ordinal attrname flag target

1 2 id string v6

Edge table EDGE

Integer table INTString table STR

Attribute table ATT

Store all edges of the XML document graph in one table

1 <book> <booktitle> The Selfish Gene </booktitle>2 <author id = “dawkins”>3 <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name>4 <address> <city> Timbuktu </city> <zip> 99999 </zip> </address> </author> </book>

v 1

book

book title

a u tho r

name

v 3v 2

firs tnam e la s tname

add re s s

v 5v 4

c ity zip

1

1 2

3 4

1 21 2

1 2

XML Storage - 7

Example: Processing of Path Queries

• The XPath query

//author/name/firstname

• Requires evaluation of the following SQL Query

SELECT S.value

FROM EDGE E1, EDGE E2, EDGE E3, EDGE E4, STR S

WHERE E1.target=E2.source AND E2.name="author" AND E2.target=E3.source AND E3.name="name" AND E3.target=E4.source AND E4.name="firstname"

AND E4.target=S.id

XML Storage - 8

Step 1: Typed Element Tables• Given three element types A, B, C, no DTD

a1

b1 c1

b2

c2 c3

a2

b3

t1 t2 t3 t4 t5

etc.

Id_A Parent_A

Content_A

a1 NULL NULL

a2 NULL NULL

Id_B Parent_B

Content_B

b1 a1 t1

b2 NULL NULL

b3 a2 t5

Id_C Parent_C

Content_C

c1 a1 t2

c2 b2 t3

c3 b2 t4

XML Storage - 9

Step 2: Exploiting a DTD

• Given three element types A, B, C, with DTD

<!ELEMENT A (B, C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>

a1

b1 c1

a2

b2 c3

t1 t2 t3 t4

etc.

Id_A Content_B

Content_C

a1 t1 t2

a2 t3 t4

_1 t5 NULL

_2 NULL t6

CREATE TABLE A( Id_A INTEGER, Content_B CHAR, Content_C CHAR PRIMARY KEY Id_A)

A

B C

b3 c4

t5 t6

XML Storage - 10

Integrity Constraints with Inlining(oct12)• Given three element types A, B, C, with DTD

<!DOCTYPE A [ <!ELEMENT A (B, C)> <!ELEMENT B #PCDATA> <!ELEMENT C #PCDATA>]>

a1

b1 c1

a2

b2 c3

t1 t2 t3 t4

etc.

Id_A Content_B

Content_C

a1 t1 t2

a2 t3 t4

CREATE TABLE A( Id_A INTEGER, Content_B CHAR NOT NULL, Content_C CHAR NOT NULL PRIMARY KEY Id_A)

A

B C

b3 c4

t5 t6

Given a DTD graph, a node is inlinable if and only if it hasexactly one incoming edge and that edge is a ,-edge.

XML Storage - 11

Step 3: Optional Elements• Given three element types A, B, C, with DTD

<!ELEMENT A (B?, C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>

• <!ELEMENT A (B | C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>

CREATE TABLE A( Id_A INTEGER, Content_B CHAR, Content_C CHAR PRIMARY KEY Id_A )

Optional elements have no impact on structure (just on constraints) !

A

B C

A

B C

?

XML Storage - 12

Step 4: Repeated Elements• Given three element types A, B, C, with DTD

<!ELEMENT A (B*,C)><!ELEMENT B #PCDATA><!ELEMENT C #PCDATA>

a1

c1

a2

b3 c2

t3 t4 t5

etc.b2b1

t1 t2

Id_A Content_C

a1 t2

a2 t5

Id_B Parent_B Content_B

b1 a1 t1

b2 a1 t2

b3 a2 t4

foreign key

CREATE TABLE B( Id_B INTEGER, Parent_B INTEGER , Content_B CHAR PRIMARY KEY Id_B FOREIGN KEY Parent_B REFERENCES A(Id_A))

A

B C

*

XML Storage - 13

Step 5: Shared Elements• Given three element types A, B, C, with DTD

<!ELEMENT A (C)><!ELEMENT B (C)><!ELEMENT C #PCDATA>

a1

c1

t1

etc.A B

C

b1

c2

t2

a2

c3

t3

b2

c4

t4

Id_A Content_C

a1 t1

a2 t3

Id_B Content_C

b1 t2

b2 t4

XML Storage - 14

Nested Content Models• The following is required (Completeness)

– Any XML DTD X must be mappable to a relational schema R– Any XML document XD conforming to DTD X must be mappable

to relational tuples in RD conforming to schema R– Any XML query against X must be mappable to an SQL query

against R– The document XD must be recoverable from the database RD

• The following is not required !– the DTD X must be recoverable from the schema R !

• Therefore we may modify the DTD before mapping to a relational schema

XML Storage - 15

Normalization of Content Models• Normalization steps:

– eliminate choice: (A|B) -> (A?, B?)– distribute occurrence operators: (A, B)* -> (A*, B*), (A, B)? -> (A?, B?)– contract occurrence operators: A** -> A*, A?? -> A?, A*? -> A*, A?* -> A* – contract elements: (A, B, A) -> (A*,B) (analogously for A*, A? instead of A)

• Example:

(A?, (B*, A, (B|C)), B) -> (eliminate choice)

(A?, (B*, A, (B?, C?)), B) -> (eliminate paranthesis)

(A?, B*, A, B?, C?, B) -> (contract element A)

(A*, B*, B?, C?, B) -> (contract element B) (A*, B*, C?)

XML Storage - 16

DTD Graph• For a DTD with normalized content models we can generate a DTD

graph– each element type is represented by exactly one node– arrows connect elements with their subelements– arrows are marked with *, + or ? according to the content model– attributes are connected to their element

A

B C

A

B C

?A

B C

*A B

C

XML Storage - 17

Creation of Relations (Top Nodes)

1. not reachable from another node

2. direct child of * or + marked edge

3. recursive node with in-degree >1

4. one node among two (or more) mutually recursive nodes with in-degree = 1

5. nodes with in-degree >1 that are reachable from the same top node by different paths composed exclusively of non-top nodes

A

A

*

B A

A B

XML Storage - 18

The "Hybrid" Inlining Algorithm

1. Normalize all content models

2. Create the DTD graph

3. Mark top nodes

4. Create a table for top nodes with a (system-generated) identifier.

5. Create attributes for leaf nodes; compose names by concatenating the element/attribute names along path from top node to leaf node.

6. Create an attribute root_type to record the root of the document.

7. If a top node is direct descendant of several top nodes, add a parent_type attribute to denote the type of the parent top node

8. If a top node is direct descendant of a top node add a parent_element attribute for keeping the foreign key of the parent top node1

9. If a node is child of multiple nodes but direct descendant of a single top node add a childof_type attribute to denote the type of the parent node

1 here direct descendant means direct descendant in the graph of top nodes only

XML Storage - 19

Example DTD

<!ELEMENT book (booktitle, author, citation)><!ELEMENT article (title, author, citation*)><!ELEMENT author (name)><!ATTLIST author id ID #REQUIRED><!ELEMENT name ((firstname, lastname)| lastname)><!ELEMENT citation (article|book)>

Step 1: Normalizing the DTD

((firstname, lastname)| lastname) -> ((firstname,lastname)?,lastname?)-> (firstname?, lastname?, lastname?) -> (firstname?, lastname*)

XML Storage - 20

Step 2-8: DTD Graph/Top Nodes/Relations

book article

authorbooktitle

name

firstname lastname

citation

authorid

*

*?

title

Book(ID_book, root_type, book_booktitle, book_author_authorid, book_author_name_firstname)Article(ID_article, root_type, article_title, article_author_authorid, article_author_name_firstname, parent_article)citation(ID_citation, root_type, parent_type, parent_citation)lastname(ID_lastname, root_type, parent_type, parent_lastname, lastname)

XML Storage - 21

Example Documents<book> <booktitle>Complexity</booktitle> <author authorid=1> <name><lastname>Karp</lastname></name> </author></book>

<article> <title>NP=P</title> <author authorid=2> <name><firstname>Li</firstname><lastname>Yu</lastname></name> </author> <citation> <book> <booktitle>Complexity Theory</booktitle> <author authorid=3> <name><lastname>Shamir</lastname></name> </author> </book> </citation></article>

XML Storage - 22

Relational TablesID_book root_type book_booktitle book_author_authorid book_author_name_f

irstname

b1 book Complexity 1 NULL

b2 book Complexity Theory

3 NULL

ID_article root_type article_title article_author_authorid article_author_name_firstname

a1 article NP=P 2 Li

ID_citation

root_type parent_type parent_citation

c1 article book b2

ID_lastname

root_type parent_type parent_lastname

lastname

l1 book book b1 Karp

l2 article article a1 Yu

l3 article book b2 Shamir

XML Storage - 23

Processing of XML Queries• Requires 3 Steps

– Step 1: Conversion of XML Queries against the DTD to SQL queries against the relational storage schema

– Step 2: Evaluation of the relational query– Step 3: Conversion of the relational query result into an XML

document

• Difficulty– Step 1 is straightforward– Step 2 performed by RDBMS– Step 3 simple for XPath queries (only elements are returned),

but very difficult for structured query results

XML Storage - 24

Conversion of XPath Queries• Conversion of path expressions

– Start from the root of the query– Inlined elements can be directly accessed– Elements in other relations are accessed via joins

• Recursive path expressions (queries containing // )– Require transformation into an SQL fixpoint query

(nonstandard feature)

• Example: The XPath query

/book/author/name/lastname

is converted to the following SQL Query

SELECT L.lastname FROM book B, lastname LWHERE B.ID_book=L.parent_lastname

XML Storage - 25

2. XML Document Filtering(oct20)

• Architecture of an XML Based SDI System

XML Conversion

SDI FilterEnglish Filtered Data

Data source

XML DocumentsUsers

User Profiles

SDI-Selective Dissemination of Information

XML Storage - 26

Introduction(OCt23)• A large number of users wants to receive XML documents from a document

stream according to individual profiles

– News feeds, stock tickers, … (compare data broadcast)

– Documents are distributed individually to users (unicast)

– Individual profiles expressed as XPath queries

• Goal: efficiently match each single incoming document against a large number of user profiles

– Minimize processing time

– Scalability in the number of queries

FilterEngine

incomingXML documents

push to subscribed users

user profiles

XM

L

Pars

er

events

XPath Parser

Dis

sem

inatio

n

filtereddocs

Profiles

XML Storage - 27

What Do You Think ?

• Possible approaches to filter XML documents given a large number of XPath expressions as profiles ?Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E

XML Storage - 28

Event-based XML Parsing (SAX)

<A> start(A, 1) start(B, 2)

text PCDATA(3) end(B)<A> start(A, 2)

<C> start(C, 3)<D> start(D, 4)text PCDATA(5)</D> end(D)

</C> end(C)</A> end(A)

</A> end(A)

event stream

XML Storage - 29

Step 1: Prefiltering Event Stream

Example Query Set:Q1 = /A/B//CQ2 = //B/*/C/DQ3 = /*/A/C//DQ4 = //B/D/EQ5 = /A/*/*/C//E

<A> start(A, 1) Q1, Q3, Q5 ok start(B, 2) Q2, Q4, oktext PCDATA(3) end(B)<A> start(A, 2)

<C> start(C, 3)<D> start(D, 4)text PCDATA(5)</D> end(D)

</C> end(C)</A> end(A)

</A> end(A)

XML Storage - 30

Step 2: Building a Query TreeA B

B C C D

D EDC E

Q1 Q3 Q5 Q2 Q4

<A> start A start Btext PCDATA end B<A> start A

<C> start C Q1 ok<D> start D Q2, Q3 oktext PCDATA

</D> end D</C> end C

</A> end A</A> end A

XML Storage - 31

Compacting the Trie

A B

B C C D

D EDC E

Q1(3) Q3(3) Q5(3) Q2(3) Q4(3)

Q1(2)

Q1(1),Q3(1),Q5(1) Q2(1),Q4(1)

Q1(0),Q2(0),Q3(0),Q4(0),Q5(0)

Q4(2)Q2(2)Q3(2),Q5(2)

A B C D E

Q1(1)Q3(1)Q5(1)

Q2(1)Q4(1)

Q1(2) Q1(3)Q3(2)Q2(2)Q5(2)

Q2(3)Q3(3)Q4(2)

Q4(3)Q5(3)

candidate lists

waiting lists

whenever a member of the candidate list is matched promote the next member of the waiting list for the respective query

hash table


XML Storage - 32

Step 3: Adding Level Information

Q(p,l,r): p position of element in XPath expressionl required level of element in documentr relative level with respect to previous element(na = not applicable, undet = undetermined)

Q(p,l,r) is called a pathnode

Q1 = /A/B//C Q1(1,1,na), Q1(2,2,1), Q1(3,undet,undet)Q2 = //B/*/C/D Q2(1,undet,na), Q2(2,undet,2), Q2(3,undet,1)Q3 = /*/A/C//D Q3(1,2,na), Q3(2,3,1), Q3(3,undet,undet)Q4 = //B/D/E Q4(1,undet,undet), Q4(2,undet,1), Q4(3,undet,1)Q5 = /A/*/*/C//E Q5(1,1,na), Q5(2,4,3), Q5(3,undet,undet)

level l can be dynamically determined (using r) while document is parsed

XML Storage - 33

Processing of Start Element Events

If start(element, level) arrives:

Step 1: obtain candidate list for element

Step 2: for each candidate entry Q(p, l, r)if l > 0 then

if l = level then continue else no match else continue

Step 3: check other conditions if applicable (attributes, text content)

Step 4: if continue and conditions are satisfied if last node of query path: successful match of profile else promote the next node corresponding to Q from the

waiting list to the candidate list

XML Storage - 34

Promotion from Waiting List

Promoting the next entry Q( p+1, lw, rw)

Step 1: add a copy of Q( p+1, lw, rw) from the waiting listto the candidate list/* Copying is required as the node may appear again later (multiple occurrences of the same element) */

Step 2: update the level value

if rw not undet then lw = rw + level else lw = undet

/* The character elements are processed analogously */

XML Storage - 35

Processing of End Element Events

If end(element) arrives:

Remove the entry in the candidate list that has been entered when the corresponding start(element,level) was encountered

Example: Q1 = /A/B//C Q1(1,1,na), Q1(2,2,1), Q1(3,undet,undet)

At the beginning Q1(1,1,na) corresponding to A is in the candidate listWhen A is encountered Q1(2,2,1) is promotedWhen end A is encountered Q1(2,2,1) is removed (the pathnode corresponding to B which is following A in the query)

XML Storage - 36

ExampleDocument

<A>text<C></C><C><D></D></C>

</A>

Document events

start(A, 1), start(B, 2), PCDATA(3), end(B, 2), start(C, 2), end(C, 2), start(b, 2), start(C, 3), start(D, 4), end(D, 4), end(C, 3), end(B, 2), end(A, 1)

Query Q = /A/B//D

Pathnodes Q(1, 1, na), Q(2, 2, 1), Q(3, undet, undet)

XML Storage - 37

Event Processing - ExampleEvent Processing Candidate list

init A: Q(1,1,na), B:_, C:_, D:_, E:_

start(A,1) promote Q(2, 2, 1) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_

start(B,2) promote Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_

PCDATA(3)

end(B) remove Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_

start(C,2)

end(C,2)

start(B,2) promote Q(3, undet, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_

start(C,3)

start(D,4) match A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:Q(3, undet, undet), E:_

end(D)

end(C)

end(B) remove Q(3, 3, undet) A: Q(1,1,na), B: Q(2, 2, 1), C:_, D:_, E:_

end(A) remove Q(2, 2, 1) A: Q(1,1,na), B:_, C:_, D:_, E:_

XML Storage - 38

Event Processing - Illustration

promoteQ(3, undet, undet)

removeQ(3, 3, undet)

removeQ(3, undet, undet)B

A

C B

pcdata C

D

promote Q(2, 2, 1)

promote Q(3, undet, undet)

Q=/A/B//D

match

removeQ(2, 2, 1)

XML Storage - 39

Summary

• How can database query processing and message filtering be compared ?

• Which approaches did we introduce to perform XML filtering ?

• Which properties are stored in path nodes ? Why is the relative position required in a path node ?

• Which properties of path nodes are modified during document event processing ? What needs to be done in document event processing when an end element tag arrives ?

• Why ist postprocesssing of filter conditions required ?

• Why is it not efficient to build candidate lists based on the initial elements of query paths ?

XML Storage - 40

References• Course material based on

– Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314

– Dongwon Lee, Wesley W. Chu: CPI: Constraints-Preserving Inlining algorithm for mapping XML DTD to relational schema. DKE 39(1): 3-25 (2001)

– Daniela Florescu, Donald Kossmann: Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin 22(3): 27-34 (1999)

– Mehmet Altinel, Michael J. Franklin: Efficient Filtering of XML Documents for Selective Dissemination of Information. VLDB 2000: 53-64

• More background material– http://www.acm.org/sigmod/record/index.html

SIGMOD RECORD, Volume 30, Number 3, September 2001Special Section on Advanced XML Data Processing (ed. K. Aberer)