managing xml and semistructured data lecture : indexes
Post on 20-Dec-2015
222 views
TRANSCRIPT
Managing XML and Semistructured Data
Lecture : Indexes
OEM vs. XML
• OEM’s objects correspond to elements in XML• Sub-elements in XML are inherently ordered.• XML elements may optionally include a list of
attribute value pairs.• Graph structure for multiple incoming edges
specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.
OEM to XML• Example:
– <Member project=“&5 &6”><name>Jones</name><age>46</age><office>
<building>gates</building><room>252</room>
</office></member>
• This corresponds to rightmost member in the example OEM, where project is an attribute.
Select xFrom A.B xWhere exists y in x.C: y = 5
In this lecture• Indexes
– XSet
– Region algebras
– Indexes for Arbitrary Semistructured Data
– Dataguides
– 1-2 indexes
Resources• Index Structures for Path Expressions by Milo and Suciu, in ICDT'99
• XSet description: http://www.openhealth.org/XSet/
• Data on the Web Abiteboul, Buneman, Suciu : section 8.2
The problem
• Input: large, irregular data graph
• Output: index structure for evaluating regular path expressions
The Data
Semistructured data instance = a large graph
The queriesRegular expressions (using Lorel-like syntax)
SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul X
Select xfrom part._*.supplier.name x
Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.
Select XFrom part._*.supplier: {name: X, address: “Philadelphia”}
Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.
Analyzing the problem
• what kind of data– tree data (XML): easier to index – graph data: used in more complex applications
• what kind of queries– restricted regular expressions (e.g. XPath): may
be more efficient
XSet: a simple index for XML
• Part of the Ninja project at Berkeley• Example XML data:
XSet: a simple index for XML
Each node = a hashtable
Each entry = list of pointers to data nodes (not shown)
XSet: Efficient query evaluation
• To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.
• R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.
• Thus, explore the entire subtree dominated by h2.
• Will be efficient if index is small and fits in memory
• R3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4.
• Can index the index itself. – Retrieve all hash tables that contain a supplier entry, continue a normal
search from there.
(R1) SELECT X FROM part.name X -yes
(R2) SELECT X FROM part.supplier.name X -yes
(R3) SELECT X FROM *.supplier.name X -maybe
(R4) SELECT X FROM part.*.subpart.name X -maybe
Region Algebras
• Structured text = text with tags (like XML)
• New Oxford English Dictionary
• critical limitation:ordered data only (like text)
• Assume: data given as an XML text file, and implicit ordering in the file.
• less critical limitation: restricted regular expressions
Region Algebras: Definitions• data = sequence of characters [c1c2c3 …]
• region = segment of the text in a file– representation (x,y) = [cx,cx+1, … cy], x – start position, y –
end position of the region– example: <section> … </section>
• region set = a set of regions s.t. any two regions are either disjoint or one included in the other– example all <section> regions (may be nested)– Tree data – each node defines a region and each set of nodes
define a region set.– example: region p2 consisting of text under p2, set {p2,s2,s1}
is a region set with three regions
Representation of a region set
• Example: the <subpart> region set:
• region algebra = operators on region set, ss11 op s op s22 defines a new region set
Region algebra: some operators
• s1 intersect s2 = {r | r s1, r s2}
• s1 included s2 = {r | rs1, r´ s2, r r´}
• s1 including s2 = {r | r s1, r´ s2, r r´}
• s1 parent s2 = {r | r s1, r´ s2, r is a parent of r´}
• s1 child s2 = {r | r s1, r´ s2, r is child of r´}
Examples:
<subpart> included <part> = { s1, s2, s3, s5}
<part> including <subpart> = {p2, p3}
<name> child <part> = {n1, n3, n12}
From path expressions to region expressions• Use region algebra operators to answer regular path expressions:
• Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene
closure *.
Region expressions correspond to simple XPath expressions
part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))
From path expressions to region expressions
• Answering more complex queries:
• Translates into the following region algebra expression:
• “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text.
• Such a region can be computed dynamically using a full text index.
• Region expressions correspond to simple XPath expressions
Select XFrom *.subpart: {name: X, *.supplier.address: “Philadelphia”}
Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))
Indexes for Arbitrary Semistructured Data
• A semistructured data instance that is a DAG
Indexes for Arbitrary Semistructured Data
• The data represents employees and projects in a company.• Two kinds of employees – programmers and statisticians• Three kinds of links to projects – leads, workson, consultants• Index graph – reduced graph that summarizes all paths from root in the data
graph• Example: node p1 – paths from root to p1 labeled with the following five
sequences:
ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson
• Node p2 – paths from root to p2 labeled by same five sequences• p1 and p2 are language-equivalent
Indexes for Arbitrary Semistructured Data
• For each node x in the data graph,
Lx = {w| a path from the root to x labeled w}
Note that Lx will be infinite if graph has a cycle!
For any two nodes x and y, they are language equivalent
x,y x y Lx = Ly
Equivalence class of x, [x] = {y | x y }
Nodes(I) = {[x] | x nodes(G)
I =
Edges(I) = {[x] [y] | x [x], y [y], x y } a a
Indexes for Arbitrary Semistructured Data
• We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7
Indexes for Arbitrary Semistructured Data
• Computing path expression queries– Compute query on I and obtain set of index nodes– Compute union of all extents, a list of pointers to all data nodes in
the equivalence class
• Returns nodes h8, h9.• Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8]• Always: size(I) size(G)• Efficient when I can be stored in main memory• Checking x y is expensive.
Select XFrom statistician.employee.(leads|consults): X
DataGuides
• Goldman & Widom [VLDB 97]– graph data– arbitrary regular expressions
DataGuides
Definition
given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G
- every path in G occurs in DB
- every path in G is unique
Dataguides
Example:
DataGuides
• Multiple DataGuides for the same data:
DataGuides
Definition
Let w, w’ be two words (I.e word queries) and G a graph
w G w’ if w(G) = w’(G)
Definition
G is a strong dataguide for a database DB if G is the same as DB
DataGuides
Example:
• G1 is a strong dataguide
• G2 is not strong
person.project !DB dept.project
person.project G2 dept.project
DataGuides• Constructing the strong DataGuide G:
Nodes(G)={{root}}Edges(G)=while changes do
choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)
• Use hash table for Nodes(G)
DataGuides• How large are the dataguides ?
– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of
G
• here: dataguide = XSet
Dataguides usually fail on data with cyclic schemas, like:
T-Indexes
• Milo & Suciu [ICDT 99]
• 1-index:– data graph– arbitrary regular expressions
• 2-index, T-index: for more complex queries, consisting of more regular expressions.
T-Indexes• T-index: template index
– Trades space for generality– The class of paths associated with a given T-index is
specified by a path template– Example 1: x y. Here can be replaced by
any regular expression.– Example 2: (*.Restaurant) x y. The first regular
expression is fixed; this T-index takes less space but is less general.
– T-indexes can be generated efficiently.– The size of a T-index associated to a single regular
expression is at most linear in that of the database
P P P
P
1-Indexes• Database: DB = (V,E,Roots), V is finite set of nodes, E is a
set of labeled edges, R is a set of root nodes.
• Regular path expressions P ::= | | ƒ | (P|P) | (P.P) | P.* where ƒ are formulas defined over predicates p1, p2,…on the set of data values.
• A path expression p = v0 v1 v2…vn-1 vn
• Queries: regular path expressions q(DB)• A query path is an expression of the form
P1 x1 P2 x2 … Pn xn, xi variable names, Pi’s path expressions
• A query has the form Select x1, x2, …, xn from P1 x1 P2 x2 … Pn xn
a1 a2 an
1-Indexes• Path template t = T1 x1 T2 x2 … T3 x3, Ti a regular
expression or or
• Instantiating query paths– Query path q = instantiating and by regular path
expression and some formula, respectively, in template t
– Example: path template t = (*.Restaurant) x1 x2 Name x3 x4
• Query path instantiations:
– q1 = (*.Restaurant) x1 * x2 Name x3 Fridays x4
– q2 = (*.Restaurant) x1 * x2 Name x3 _ x4 ( _ is a predicate with True)
– q3 = (*.Restaurant) x1 ( | _ ) x2 Name x3 Fridays x4
P F
P F
P F
1-Indexes• Goal: compute efficiently queries q inst( x)
• A first attempt:
• Lu is the set of words on path reachable from root to u.
• That is, all the path queries that lead to u.
uV. Lu = {a1…an | v0 … vn DB, v0Root, vn=u}
u,vV. u v Lu = Lv
That is, u and v are indistinguishable by path queries from root.
uV.
[u] = {v | u v} is a equivalence class containing u
a1 an
P
1-Indexes
Nodes(I) = { [u] | u in nodes(DB) }Edges(I) = { [u] [u] | u [u], u [u], (u u) Edges(DB)}Roots(I) = { [r] | r roots(DB) }
I =
q(DB) = { u | [u] q(I), u [u] }
Example:
That is, there will be an edge e in the index tree between s and s’ if there is an edge e between a node in s and a node in s’. if Inefficient: construction cost
aa
Analyzing1-Indexes• Storing I-index
– Associate an oid s to each node in I
– Store graph I in standard form
– Store for each node s, extent(s)• Extent(s) = { [v] | s is an oid for [v] }
• Always: size(I) <= size(DB) (unlike Dataguide)• Always: can compute in O(nlogn) time n=size(DB)• When DB is a tree
– 1-index = Dataguide = XSet
Analyzing1-Indexes
• Do we have size(I) << size(DB) ? No. Two worst cases:
• Facts:– in theory: except for these two DB’s, size(I) << size(DB)
– in practice: it’s a different story. Experiments: size(I) 1/3 size(DB)
Evaluating Query Paths with 1-indexes
• Example: evaluate query path P x– q(DB) = q(I)
– Let Nodes(I) = {s1, s2, … , sk | each si, 1 i k, satisfies query path P x}
– q(DB) = extent(s1) extent(s2) … extent(sk)
Evaluating Query Paths with 1-indexes
• Example: query q = t.a x
• The evaluation of q follows two paths t.a in I rather than five in DB and unions their extents: {7,13} {8,10,12}
• The extents in strong data guide overlap, hence storage may be larger
2-Indexes
• Database: DB = (V, E, Roots)• Queries: select x1, x2 from * x1 P x2, with P a regular path
expression• Template: * x1 x2. • Find: pairs of nodes (x1, x2)• L(u,v) set of words on the path between (u,v)
L(u,v) = {a1 … an | u … v in DB}
(u,v) (u,v) L(u,v) = L(u,v), that is, they are indistingushable by path queries of the form root * x1 x2.
P
a1an
P
2-IndexesNodes(I) = {[(u,v)] | u,v Nodes(DB) }
I2 = Roots(I) = { [(u,u)] | u Nodes(DB) }
Edges(I) = { [(u,v)] [(u,v)] | v v Edges(DB) }
• Storing I2
– The graph – Extent(s) = [(v,u)], for each node s representing the equivalence class [(v,u)]
• L(v,u)(DB) = L[(v,u)](I2), – L(v,u)(DB) represents paths between v and u– L[(v,u)](I2) represents the paths in the 2-index I2, between some root of the index
and [(v,u)]• Query evaluation
– To compute select x, y from * x P y, we compute the query path P y on I2 and take the union of the extents.
– This saves the * search, but may have to start at several roots in I2, which is only one in case of acyclic databases
a a
2-Index: Example
• Cost: size(I) O(n2)
• May be less in practice, similar to PAT trees (Patricia tree) for text databases
Conclusions• work on structured text: relevant but
restrictive
• trees are simple: XSet = Dataguides = 1-index (conceptually)
• 1-index: scales to cyclic data too
• more complex queries: 2-index, T-index