Download - Chapter 10: XML
![Page 1: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/1.jpg)
Chapter 10: XML
The world of XML
![Page 2: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/2.jpg)
The Data
Semistructured data instance = a large graph
![Page 3: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/3.jpg)
The indexing problem
• The storage problem– Store the graph in a relational DBMS– Develop a new database storage structure
• The indexing problem:– Input: large, irregular data graph– Output: index structure for evaluating (regular) path
expressions, e.g.bib.paper.author.firstname
![Page 4: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/4.jpg)
XSet: a simple index for XML
• Part of the Ninja project at Berkeley• Example XML data:
![Page 5: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/5.jpg)
XSet: a simple index for XML
Each node = a hashtableEach entry = list of pointers to data nodes (not shown)
• SELECT X FROM part.name X -yes• SELECT X FROM part.supplier.name X -yes• SELECT X FROM part.*.subpart.name X -maybe• SELECT X FROM *.supplier.name X -maybe
![Page 6: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/6.jpg)
Region Algebras
• structured text = text with tags (like XML)• data = sequence of characters [c1c2c3 …]• region = interval in the text
– representation (x,y) = [cx,cx+1, … cy]– example: <section> … </section>
• region set = a set of regions– example all <section> regions (may be nested)
• region algebra = operators on region set, s1 op s2s1 op s2
• s1 intersect s2 = {r | r s1, r s2}• s1 included s2 = {r | rs1, r’ s2, r r’}• s1 including s2 = {r | r s1, r’ s2, r r’}• s1 parent s2 = {r | r s1, r’ s2, r is a parent of r’}• s1 child s2 = {r | r s1, r’ s2, r is child of r’}
![Page 7: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/7.jpg)
Region Algebras
part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))
Region expressions correspond to simple XPath expressions
•s1 child s2 = {r | r s1, r’ s2, r is child of r’}
![Page 8: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/8.jpg)
Efficient computation of Region Algebra Operators
Example: s1 included s2s1 = {(x1,x1'), (x2,x2'), …}s2 = {(y1,y1'), (y2,y2'), …}(i.e. assume each consists of disjoint regions)
Algorithm:if xi < yj then i := i + 1if xi' > yj' then j := j + 1otherwise: print (xi,xi'), do i := i + 1
Can do in sub-linear time when one region is very small
![Page 9: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/9.jpg)
Storage structures for region algebras
• Every node is characterised by an integer pair (x,y)• This means we have a 2-d space• Any 2-d space data structure can be used
• If you use a (pre-order,post-order) numbering you get triangular filling of 2-d(to be discussed later)
![Page 10: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/10.jpg)
Alternative mappings
• Mapping the structure to the relational world– The Edge approach– The Attribute approach– The Universal Table approach– The Normalized Universal approach– The Monet/XML approach– The Dataguide approach
• Mapping values– Separate value tables– Inlining
• Shredding
![Page 11: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/11.jpg)
Dataguide approach
• Developed in the context of Lore, Lorel (Stanford Univ)• Predecessor of the Monet/XML model• Observation:
– queries in the graph-representation take a limited form– they are partial walks from the root to an object of
interest– this behaviour was stressed by the query language
Lorel, i.e. an SQL-based query language based on processing regular expressions
SELECT X
FROM (Bib.*.author).(lastname|firstname).Abiteboul X
![Page 12: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/12.jpg)
DataGuides
Definitiongiven a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G- every path in G occurs in DB- every path in G is unique
![Page 13: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/13.jpg)
Dataguides
Example:
![Page 14: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/14.jpg)
DataGuides
• Multiple DataGuides for the same data:
![Page 15: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/15.jpg)
DataGuides
DefinitionLet w, w’ be two words (I.e word queries) and G a graphw G w’ if w(G) = w’(G)
DefinitionG is a strong dataguide for a database DB if G is the same as DB
Example:- G1 is a strong dataguide- G2 is not strong
person.project !DB dept.projectperson.project !G2 dept.project
![Page 16: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/16.jpg)
DataGuides
• Constructing the strong DataGuide G:Nodes(G)={{root}}Edges(G)=while changes do
choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)
• Use hash table for Nodes(G)• This is precisely the powerset automaton construction.
![Page 17: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/17.jpg)
DataGuides
• How large are the dataguides ?– if DB is a tree, then size(G) <= size(DB)
• why? answer: every node is in exactly one extent of G• here: dataguide = XSet
– How many nodes does the strong dataguide have for this DB ? 20 nodes (least common
multiple of 4 and 5)
Dataguides usually fail on data with cyclic schemas, like:
![Page 18: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/18.jpg)
Monet XML approachMonet XML approach
![Page 19: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/19.jpg)
Monet XML approach
![Page 20: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/20.jpg)
Monet XML approach
![Page 21: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/21.jpg)
Monet XML approach
![Page 22: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/22.jpg)
Monet XML approach
![Page 23: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/23.jpg)
• Querying the XML world
![Page 24: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/24.jpg)
Querying and Transforming XML Data
• Standard XML querying/translation languages– XPath
• Simple language consisting of path expressions
– XSLT• Simple language designed for translation from XML to XML
and XML to HTML
– XQuery• An XML query language with a rich set of features
• Wide variety of other languages have been proposed, and some served as basis for the Xquery standard– XML-QL, Quilt, XQL, …
![Page 25: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/25.jpg)
XPath
• XPath is used to address (select) parts of documents using path expressions
• A path expression is a sequence of steps separated by “/”– Think of file names in a directory hierarchy
• Result of path expression: set of values that along with their containing elements/attributes match the specified path
• E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name><name>Mary</name>
• E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags
![Page 26: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/26.jpg)
XPath (Cont.)
• The initial “/” denotes root of the document (above the top-level tag)• Path expressions are evaluated left to right
– Each step operates on the set of instances produced by the previous step• Selection predicates may follow any step in a path, in [ ]
– E.g. /bank-2/account[balance > 400] • returns account elements with a balance value greater than 400• /bank-2/account[balance] returns account elements containing a balance
subelement• Attributes are accessed using “@”
– E.g. /bank-2/account[balance > 400]/@account-number• returns the account numbers of those accounts with balance > 400
– IDREF attributes are not dereferenced automatically (more on this later)
![Page 27: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/27.jpg)
Functions in XPath
• XPath provides several functions– The function count() at the end of a path counts the number of
elements in the set generated by the path• E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers– Also function for testing position (1, 2, ..) of node w.r.t. siblings
• Boolean connectives and and or and function not() can be used in predicates
• IDREFs can be referenced using function id()– id() can also be applied to sets of references such as IDREFS and
even to strings containing multiple references separated by blanks– E.g. /bank-2/account/id(@owner)
• returns all customers referred to from the owners attribute of account elements.
![Page 28: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/28.jpg)
More XPath Features• Operator “|” used to implement union
– E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)• gives customers with either accounts or loans• However, “|” cannot be nested inside other operators.
• “//” can be used to skip multiple levels of nodes – E.g. /bank-2//name
• finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained.
• A step in the path can go to (13 variations in the standard):parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not just to the children– “//”, described above, is a short from for specifying “all descendants”– “..” specifies the parent.
![Page 29: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/29.jpg)
Pathfinder
• Xpath is essential for the implementation of an Xquery processor. It is strongly related to the data structures and its primitives.
• A state-of-the-art implementation is MonetDB/Pathfinder developed by Uni. Konstantz, Twente University, CWI
![Page 30: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/30.jpg)
Pathfinder Uni Konstantz
![Page 31: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/31.jpg)
Pathfinder
![Page 32: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/32.jpg)
Pathfinder
![Page 33: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/33.jpg)
Pathfinder
![Page 34: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/34.jpg)
![Page 35: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/35.jpg)
Pathfinder
![Page 36: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/36.jpg)
Pathfinder
![Page 37: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/37.jpg)
Pathfinder
![Page 38: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/38.jpg)
pathfinder
![Page 39: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/39.jpg)
Pathfinder
![Page 40: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/40.jpg)
Staircase join
![Page 41: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/41.jpg)
Staircase join
![Page 42: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/42.jpg)
Pathfinder
![Page 43: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/43.jpg)
Pathfinder
![Page 44: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/44.jpg)
Pathfinder
![Page 45: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/45.jpg)
Pathfinder
![Page 46: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/46.jpg)
XQuery
![Page 47: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/47.jpg)
XQuery
• XQuery is a general purpose query language for XML data
• Currently being standardized by the World Wide Web Consortium (W3C)– The textbook description is based on a March 2001 draft of the standard.
The final version may differ, but major features likely to stay unchanged.
• Alpha version of XQuery engine– Galax http://db.bell-labs.com/galax/– IPSI-IQ– Xpath visualized http://www.vbxml.com/xpathvisualizer/ – MonetDB/Pathfinder– Xhive
• XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL
![Page 48: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/48.jpg)
XQuery
• XQuery uses a for … let … where .. return … syntax
• for SQL from where SQL where return SQL select let allows temporary variables, and has no equivalent in SQL
• Variables make it possible to keep the state of processing around and severely complicates optimization
![Page 49: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/49.jpg)
FLWR Syntax in XQuery
• For clause uses XPath expressions, and variables in the for- clause ranges over values in the set returned by Xpath
• XPath is used to address (select) parts of documents using path expressions• A path expression is a sequence of steps separated by “/”• Result of path expression: set of values that along with their containing
elements/attributes match the specified path
• E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns
<name>Joe</name><name>Mary</name>
• E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags
![Page 50: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/50.jpg)
XPath
• XPath is used to address (select) parts of documents using path expressions• A path expression is a sequence of steps separated by “/”
– Think of file names in a directory hierarchy• Result of path expression: set of values that along with their containing
elements/attributes match the specified path
• E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returns <name>Joe</name><name>Mary</name>
• E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags
![Page 51: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/51.jpg)
XPath (Cont.)
• The initial “/” denotes root of the document (above the top-level tag)• Path expressions are evaluated left to right
– Each step operates on the set of instances produced by the previous step
• Selection predicates may follow any step in a path, in [ ]– E.g. /bank-2/account[balance > 400]
• returns account elements with a balance value greater than 400• /bank-2/account[balance] returns account elements containing a balance subelement
• Attributes are accessed using “@”– E.g. /bank-2/account[balance > 400]/@account-number
• returns the account numbers of those accounts with balance > 400– IDREF attributes are not dereferenced automatically (more on this later)
![Page 52: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/52.jpg)
Functions in XPath
• XPath provides several functions– The function count() at the end of a path counts the number of elements
in the set generated by the path• E.g. /bank-2/account[customer/count() > 2]
– Returns accounts with > 2 customers– Also function for testing position (1, 2, ..) of node w.r.t. siblings
• Boolean connectives and and or and function not() can be used in predicates• IDREFs can be referenced using function id()
– id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks
– E.g. /bank-2/account/id(@owner) • returns all customers referred to from the owners attribute of account
elements.
![Page 53: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/53.jpg)
More XPath Features• Operator “|” used to implement union
– E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)• gives customers with either accounts or loans• However, “|” cannot be nested inside other operators.
• “//” can be used to skip multiple levels of nodes – E.g. /bank-2//name
• finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained.
• A step in the path can go to (13 variations in the standard):parents, siblings, ancestors and descendants
of the nodes generated by the previous step, not just to the children– “//”, described above, is a short from for specifying “all descendants”– “..” specifies the parent.
![Page 54: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/54.jpg)
FLWR Syntax in XQuery
• Simple FLWR expression in XQuery – find all accounts with balance > 400, with each result enclosed in an
<account-number> .. </account-number> tag for $x in /bank-2/account let $acctno := $x/@account-number where $x/balance > 400 return <account-number> $acctno </account-number>
• Let clause not really needed in this query, and selection can be done In XPath. Query can be written as:
for $x in /bank-2/account[balance>400]return <account-number> $X/@account-number
</account-number>
![Page 55: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/55.jpg)
Path Expressions and Functions
• Path expressions are used to bind variables in the for clause, but can also be used in other places– E.g. path expressions can be used in let clause, to bind variables to
results of path expressions• The function distinct( ) can be used to removed duplicates in path
expression results• The function document(name) returns root of named document
– E.g. document(“bank-2.xml”)/bank-2/account• Aggregate functions such as sum( ) and count( ) can be applied to path
expression results• XQuery does not support groupby, but the same effect can be got by
nested queries, with nested FLWR expressions within a return clause – More on nested queries later
![Page 56: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/56.jpg)
Joins• Joins are specified in a manner very similar to SQL
for $b in /bank/account, $c in /bank/customer, $d in /bank/depositor
where $a/account-number = $d/account-number and $c/customer-name = $d/customer-name
return <cust-acct> $c $a </cust-acct>• The same query can be expressed with the selections specified as
XPath selections: for $a in /bank/account $c in /bank/customer
$d in /bank/depositor[ account-number =$a/account-number and customer-name = $c/customer-name] return <cust-acct> $c $a</cust-acct>
![Page 57: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/57.jpg)
Changing Nesting Structure• The following query converts data from the flat structure for bank information
into the nested structure used in bank-1 <bank-1>
for $c in /bank/customer return
<customer> $c/* for $d in /bank/depositor[customer-name = $c/customer-name], $a in /bank/account[account-number=$d/account-number] return $a
</customer> </bank-1>
• $c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag
• Exercise for reader: write a nested query to find sum of accountbalances, grouped by branch.
![Page 58: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/58.jpg)
XQuery Path Expressions
• $c/text() gives text content of an element without any subelements/tags
• XQuery path expressions support the “–>” operator for dereferencing IDREFs– Equivalent to the id( ) function of XPath, but simpler to
use– Can be applied to a set of IDREFs to get a set of results
![Page 59: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/59.jpg)
Sorting in XQuery • Sortby clause can be used at the end of any expression. E.g. to return
customers sorted by name for $c in /bank/customer return <customer> $c/* </customer> sortby(name)
• Can sort at multiple levels of nesting (sort by customer-name, and by account-number within each customer)
<bank-1> for $c in /bank/customer return
<customer> $c/* for $d in /bank/depositor[customer-name=$c/customer-name], $a in /bank/account[account-number=$d/account-number]
return <account> $a/* </account> sortby(account-number)</customer> sortby(customer-name)
</bank-1>
![Page 60: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/60.jpg)
Functions and Other XQuery Features
• User defined functions with the type system of XMLSchema function balances(xsd:string $c) returns list(xsd:numeric) { for $d in /bank/depositor[customer-name = $c], $a in /bank/account[account-number=$d/account-number] return $a/balance
}• Types are optional for function parameters and return values• Universal and existential quantification in where clause predicates
– some $e in path satisfies P – every $e in path satisfies P
• XQuery also supports If-then-else clauses
![Page 61: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/61.jpg)
• Xmark http://www.xml-benchmark.org
• Used in most experiments on Xpath and Xquery evaluation
• Old figures on hand-compiled queries for the dataguide approach can be found in
• http://www.cwi.nl/~mk/xmarkArchive/Reports/Monet_report/monet_report.html
![Page 62: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/62.jpg)
Xmark
![Page 63: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/63.jpg)
XMark
![Page 64: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/64.jpg)
Monet XML approachMonet XML approach
![Page 65: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/65.jpg)
XMark
• Q1 Return the name of the person with ID ‘personal’
FOR $b IN /site/people/person[@id=‘personal’]RETURN $b/name/text()
![Page 66: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/66.jpg)
Query optimizer challenges
• Mapping Xquery to a RBDMS should be able – to deal with ordered tables– to skip sub-documents– to perform dynamic type casting– to avoid unnecessary construction of string
intermediates– to recognize join-paths for fast access– to balance fragmentation and reconstruction cose
![Page 67: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/67.jpg)
Xmark answers
• Q2: Return the initial increases of all open auctions.– This query evaluates the cost of array look-ups. Note that this
query may actually be harder to evaluate than it looks; especially relational back-ends may have to struggle with rather complex aggregations to select the bidder element with index 1.
FOR $b IN document("auction.xml")/site/open_auctions/open_auctionRETURN <increase> $b/bidder[1]/increase/text() </increase>
![Page 68: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/68.jpg)
XMark
• Q3: Return the IDs of all open auctions whose current increase is at least twice as high as the initial increase.– This is a more complex application of index lookups. In the case
of a relational DBMS, the query can take advantage of set-valued aggregates on the index attribute to accelerate the execution.
FOR $b IN document("auction.xml")/site/open_auctions/open_auctionWHERE $b/bidder[0]/increase/text() *2 <= $b/bidder[last()]/increase/text()RETURN <increase first=$b/bidder[0]/increase/text() last=$b/bidder[last()]/increase/text()/>
![Page 69: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/69.jpg)
Xmark result
• Q4: List the reserves of those open auctions where a certain person issued a bid before another person– This time, we stress the textual nature of XML documents by
querying the tag order in the source document
FOR $b IN document("auction.xml")/site/open_auctions/open_auctionWHERE $b/bidder/personref[id="person18829"] BEFORE $b/bidder/personref[id="person10487"]RETURN <history> $b/initial/text() </history>
![Page 70: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/70.jpg)
Xmark answers
• Q5: How many sold items cost more than 40?– Strings are the generic data type in XML documents. Queries that
interpret strings will often need to cast strings to another data type that carries more semantics. This query challenges the DBMS in terms of the casting primitives it provides. Especially, if there is no additional schema information or just a DTD at hand, casts are likely to occur frequently.
COUNT (FOR $i document("auction.xml")/site/closed_auctions/closed_auction WHERE $i/price/text() >= 40 RETURN $i/price)
![Page 71: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/71.jpg)
Xmark results
Q6: How many items are listed on all continents? • Regular path expressions are a fundamental building block of virtually
every query language for XML or semi-structured data. These queries investigate how well the query processor can optimize path expressions and prune traversals of irrelevant parts of the tree.
FOR $b IN document("auction.xml")/site/regionsRETURN COUNT ($b//item)
![Page 72: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/72.jpg)
Xmark results
Q7: How many pieces of prose are in our database?• A good evaluation engine should realize that there is no need to
traverse the complete document tree to evaluate such expressions.Also note that the COUNT aggregation does not require a complete traversal of the tree. Just the cardinality of the respective relation is queried. Note that the tag <email> does not exist in the database document.
FOR $p IN document("auction.xml")/siteRETURN count($p//description) + count($p//annotation) +
count($p//email);
![Page 73: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/73.jpg)
Xmark results
Q8: List the names of persons and the number of items they bought. (joins person, closed\_auction)
• References are an integral part of XML as they allow richer relationships than just hierarchical element structures. This query defines horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined.
FOR $p IN document("auction.xml")/site/people/personLET $a := FOR $t IN
document("auction.xml")/site/closed_auctions/closed_auction WHERE $t/buyer/@person = $p/@id RETURN $tRETURN <item person=$p/name/text()> COUNT ($a) </item>
![Page 74: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/74.jpg)
Xmark results
Q9: List the names of persons and the names of the items they bought in Europe. (joins person, closed_auction, item)
• References are an integral part of XML as they allow richer relationships than just hierarchical element structures. These queries define horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined.
FOR $p IN document("auction.xml")/site/people/personLET $a := FOR $t IN
document("auction.xml")/site/closed_auctions/closed_auction LET $n := FOR $t2 IN document("auction.xml")/site/regions/europe/item WHERE $t/itemref/@item = $t2/@id RETURN $t2 WHERE $p/@id = $t/buyer/@person RETURN <item> $n/name/text() </item>RETURN <person name=$p/name/text()> $a </person>
![Page 75: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/75.jpg)
Xmark results
Q10: List all persons according to their interest; use french markup in the result.
• Constructing new elements may put the storage engine under stress especially in the context of creating materialized document views. The following query reverses the structure of person records by grouping them according to the interest profile of a person. Large parts of the person records are repeatedly reconstructed. To avoid simple copying of the original database we translate the mark-up into french.
![Page 76: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/76.jpg)
FOR $i IN DISTINCT document("auction.xml")/site/people/person/profile/interest/@category
LET $p := FOR $t IN document("auction.xml")/site/people/person WHERE $t/profile/interest/@category = $i RETURN <personne> <statistiques> <sexe> $t/gender/text() </sexe>, <age> $t/age/text() </age>, <education> $t/education/text()</education>, <revenu> $t/income/text() </revenu> </statistiques>,
![Page 77: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/77.jpg)
<coordonnees> <nom> $t/name/text() </nom>, <rue> $t/street/text() </rue>, <ville> $t/city/text() </ville>, <pays> $t/country/text() </pays>, <reseau> <courrier> $t/email/text() </courrier>, <pagePerso> $t/homepage/text()</pagePerso> </reseau>, </coordonnees> <cartePaiement> $t/creditcard/text()</cartePaiement> </personne>RETURN <categorie> <id> $i </id>, $p </categorie>
![Page 78: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/78.jpg)
Xmark results
Q11: For each person, list the number of items currently on sale whose price does not exceed 0.02\% of the person's income
• This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in the size of the result set and provide various optimization opportunities.
FOR $p IN document("auction.xml")/site/people/personLET $c := FOR $i IN
document("auction.xml")/site/open_auctions/open_auction/initial WHERE $p/profile/@income > (5000 * $i/text()) RETURN $iRETURN <items name=$p/profile/@income> COUNT ($c) </items>
![Page 79: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/79.jpg)
Xmark results
12: For each richer-than-average person, list the number of items currently on sale whose price does not exceed 0.02% of the person's income
• This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in the size of the result set and provide various optimization opportunities. FOR $p IN document("auction.xml")/site/people/person
FOR $p IN document("auction.xml")/site/people/person LET $l := FOR $i IN
document("auction.xml")/site/open_auctions/open_auction/initial WHERE $p/profile/@income > (5000 * $i/text()) RETURN $iWHERE $p/profile/@income > 50000RETURN <items income=$p/profile/@income> COUNT ($l) </items>
![Page 80: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/80.jpg)
Xmark results
Q13: List the names of items registered in Australia along with their descriptions.
• A key design for XML->DBMS mappings is to determine the fragmentation criteria. The complementary action is to reconstruct the original document from its broken-down representation. Query 13 tests for the ability of the database to reconstruct portions of theoriginal XML document.
• FOR $i IN document("auction.xml")/site/regions/australia/item RETURN <item name=$i/name/text()> $i/description </item>
![Page 81: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/81.jpg)
Xmark results
Q14:Return the names of all items whose description contains the word `gold'.
• We continue to challenge the textual nature of XML documents; this time, we conduct a full-text search in the form of keyword search. Although full-text scanning could be studied in isolation we think that the interaction with structural mark-up is essential as the concepts are considered orthogonal; so query Q14 is restricted to a subset of the document by combining content and structure.
FOR $i IN document("auction.xml")/site//itemWHERE CONTAINS ($i/description,"gold")RETURN $i/name/text()
![Page 82: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/82.jpg)
Xmark results
Q15: Print the keywords in emphasis in annotations of closed auctions.• We now try to quantify the costs of long path traversals that don't
include wildcards. We first descend deep into the tree (Query 15) and then return again (Query 16). Both queries only check for the existence of paths rather than selecting paths with predicates.
FOR $a IN document("auction.xml")/site/closed_auctions/closed_auction/annotation/description/parlist/listitem/parlist/listitem/text/emph/keyword/text()
RETURN <text> $a </text>
![Page 83: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/83.jpg)
Xmark results
Q16: Return the IDs of those auctions that have one or more kweywords in emphasis.
FOR $a IN document("auction.xml")/site/closed_auctions/closed_auctionWHERE NOT EMPTY ($a/annotation/description/parlist/listitem/parlist/\ listitem/text/emph/keyword/text())RETURN <person id=$a/seller/@person />
![Page 84: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/84.jpg)
Xmark results
Q17:Which persons don't have a homepage?• This is to test how well the query processors knows to deal with the
semi-structured aspect of XML data, especially elements that are declared optional in the DTD.
FOR $p IN document("auction.xml")/site/people/personWHERE EMPTY($p/homepage/text())RETURN <person name=$p/name/text()/>
![Page 85: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/85.jpg)
Xmark results
Q18:Convert the currency of the reserve of all open auctions to another currency.
• This query puts the application of user defined functions (UDF) to the proof. In the XML world, UDFs are of particular importance because they allow the user to assign semantics to generic strings that go beyond type coercion.
FUNCTION CONVERT ($v){ RETURN 2.20371 * $v -- convert Dfl to Euros}
FOR $i IN document("auction.xml")/site/open_auctions/open_auction/RETURN CONVERT($i/reserve/text())
![Page 86: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/86.jpg)
Query optimizer challenges
• Mapping Xquery to a RBDMS should be able – to deal with ordered tables– to skip sub-documents– to perform dynamic type casting– to avoid unnecessary construction of string
intermediates– to recognize join-paths for fast access– to balance fragmentation and reconstruction cose
![Page 87: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/87.jpg)
Xmark results
Effect of loading 100Mb document into DBMS
![Page 88: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/88.jpg)
Xmark results
![Page 89: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/89.jpg)
Xmark results
![Page 90: Chapter 10: XML](https://reader033.vdocuments.us/reader033/viewer/2022052312/5681389a550346895da04916/html5/thumbnails/90.jpg)
Pathfinder/MonetDB 2004 implementation in seconds