managing xml and semistructured data lecture : indexes

Managing XML and Semistructured Data

Lecture : Indexes

OEM vs. XML

• OEM’s objects correspond to elements in XML• Sub-elements in XML are inherently ordered.• XML elements may optionally include a list of

attribute value pairs.• Graph structure for multiple incoming edges

specified in XML with references (ID, IDREF attributes). i.e. the Project attribute.

OEM to XML• Example:

– <Member project=“&5 &6”><name>Jones</name><age>46</age><office>

<building>gates</building><room>252</room>

</office></member>

• This corresponds to rightmost member in the example OEM, where project is an attribute.

Select xFrom A.B xWhere exists y in x.C: y = 5

In this lecture• Indexes

– XSet

– Region algebras

– Indexes for Arbitrary Semistructured Data

– Dataguides

– 1-2 indexes

Resources• Index Structures for Path Expressions by Milo and Suciu, in ICDT'99

• XSet description: http://www.openhealth.org/XSet/

• Data on the Web Abiteboul, Buneman, Suciu : section 8.2

The problem

• Input: large, irregular data graph

• Output: index structure for evaluating regular path expressions

The Data

Semistructured data instance = a large graph

The queriesRegular expressions (using Lorel-like syntax)

SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul X

Select xfrom part._*.supplier.name x

Requires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.

Select XFrom part._*.supplier: {name: X, address: “Philadelphia”}

Need index on values to narrow search to parts of the database that contain the string “Philadelphia”.

Analyzing the problem

• what kind of data– tree data (XML): easier to index – graph data: used in more complex applications

• what kind of queries– restricted regular expressions (e.g. XPath): may

be more efficient

XSet: a simple index for XML

• Part of the Ninja project at Berkeley• Example XML data:

XSet: a simple index for XML

Each node = a hashtable

Each entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluation

• To evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.

• R4 – following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.

• Thus, explore the entire subtree dominated by h2.

• Will be efficient if index is small and fits in memory

• R3 – leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4.

• Can index the index itself. – Retrieve all hash tables that contain a supplier entry, continue a normal

search from there.

(R1) SELECT X FROM part.name X -yes

(R2) SELECT X FROM part.supplier.name X -yes

(R3) SELECT X FROM *.supplier.name X -maybe

(R4) SELECT X FROM part.*.subpart.name X -maybe

Region Algebras

• Structured text = text with tags (like XML)

• New Oxford English Dictionary

• critical limitation:ordered data only (like text)

• Assume: data given as an XML text file, and implicit ordering in the file.

• less critical limitation: restricted regular expressions

Region Algebras: Definitions• data = sequence of characters [c1c2c3 …]

• region = segment of the text in a file– representation (x,y) = [cx,cx+1, … cy], x – start position, y –

end position of the region– example: <section> … </section>

• region set = a set of regions s.t. any two regions are either disjoint or one included in the other– example all <section> regions (may be nested)– Tree data – each node defines a region and each set of nodes

define a region set.– example: region p2 consisting of text under p2, set {p2,s2,s1}

is a region set with three regions

Representation of a region set

• Example: the <subpart> region set:

• region algebra = operators on region set, ss11 op s op s22 defines a new region set

Region algebra: some operators

• s1 intersect s2 = {r | r s1, r s2}

• s1 included s2 = {r | rs1, r´ s2, r r´}

• s1 including s2 = {r | r s1, r´ s2, r r´}

• s1 parent s2 = {r | r s1, r´ s2, r is a parent of r´}

• s1 child s2 = {r | r s1, r´ s2, r is child of r´}

Examples:

<subpart> included <part> = { s1, s2, s3, s5}

<part> including <subpart> = {p2, p3}

<name> child <part> = {n1, n3, n12}

From path expressions to region expressions• Use region algebra operators to answer regular path expressions:

• Only restricted forms of regular path expressions can be translated into region algebra operators – expressions of the form R1.R2…Rn, where each Ri is either a label constant or the Kleene

closure *.

Region expressions correspond to simple XPath expressions

part.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))

From path expressions to region expressions

• Answering more complex queries:

• Translates into the following region algebra expression:

• “Philadelphia” denotes a region set consisting of all regions corresponding to the word “Philadelphia” in the text.

• Such a region can be computed dynamically using a full text index.

• Region expressions correspond to simple XPath expressions

Select XFrom *.subpart: {name: X, *.supplier.address: “Philadelphia”}

Name child (subpart includes (supplier parent (address intersect “Philadelphia”)))

Indexes for Arbitrary Semistructured Data

• A semistructured data instance that is a DAG


• The data represents employees and projects in a company.• Two kinds of employees – programmers and statisticians• Three kinds of links to projects – leads, workson, consultants• Index graph – reduced graph that summarizes all paths from root in the data

graph• Example: node p1 – paths from root to p1 labeled with the following five

sequences:

ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson

• Node p2 – paths from root to p2 labeled by same five sequences• p1 and p2 are language-equivalent


• For each node x in the data graph,

Lx = {w| a path from the root to x labeled w}

Note that Lx will be infinite if graph has a cycle!

For any two nodes x and y, they are language equivalent

x,y x y Lx = Ly

Equivalence class of x, [x] = {y | x y }

Nodes(I) = {[x] | x nodes(G)

I =

Edges(I) = {[x] [y] | x [x], y [y], x y } a a


• We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7


• Computing path expression queries– Compute query on I and obtain set of index nodes– Compute union of all extents, a list of pointers to all data nodes in

the equivalence class

• Returns nodes h8, h9.• Their extents are [p5, p6, p7] and [p8], respectively; • result set = [p5, p6, p7, p8]• Always: size(I) size(G)• Efficient when I can be stored in main memory• Checking x y is expensive.

Select XFrom statistician.employee.(leads|consults): X

DataGuides

• Goldman & Widom [VLDB 97]– graph data– arbitrary regular expressions

DataGuides

Definition

given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G

- every path in G occurs in DB

- every path in G is unique

Dataguides

Example:

DataGuides

• Multiple DataGuides for the same data:

DataGuides

Definition

Let w, w’ be two words (I.e word queries) and G a graph

w G w’ if w(G) = w’(G)

Definition

G is a strong dataguide for a database DB if G is the same as DB

DataGuides

Example:

• G1 is a strong dataguide

• G2 is not strong

person.project !DB dept.project

person.project G2 dept.project

DataGuides• Constructing the strong DataGuide G:

Nodes(G)={{root}}Edges(G)=while changes do

choose s in Nodes(G), a in Labelsadd s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)

• Use hash table for Nodes(G)

DataGuides• How large are the dataguides ?

– if DB is a tree, then size(G) <= size(DB)• why? answer: every node is in exactly one extent of

G

• here: dataguide = XSet

Dataguides usually fail on data with cyclic schemas, like:

T-Indexes

• Milo & Suciu [ICDT 99]

• 1-index:– data graph– arbitrary regular expressions

• 2-index, T-index: for more complex queries, consisting of more regular expressions.

T-Indexes• T-index: template index

– Trades space for generality– The class of paths associated with a given T-index is

specified by a path template– Example 1: x y. Here can be replaced by

any regular expression.– Example 2: (*.Restaurant) x y. The first regular

expression is fixed; this T-index takes less space but is less general.

– T-indexes can be generated efficiently.– The size of a T-index associated to a single regular

expression is at most linear in that of the database

P P P

P

1-Indexes• Database: DB = (V,E,Roots), V is finite set of nodes, E is a

set of labeled edges, R is a set of root nodes.

• Regular path expressions P ::= | | ƒ | (P|P) | (P.P) | P.* where ƒ are formulas defined over predicates p1, p2,…on the set of data values.

• A path expression p = v0 v1 v2…vn-1 vn

• Queries: regular path expressions q(DB)• A query path is an expression of the form

P1 x1 P2 x2 … Pn xn, xi variable names, Pi’s path expressions

• A query has the form Select x1, x2, …, xn from P1 x1 P2 x2 … Pn xn

a1 a2 an

1-Indexes• Path template t = T1 x1 T2 x2 … T3 x3, Ti a regular

expression or or

• Instantiating query paths– Query path q = instantiating and by regular path

expression and some formula, respectively, in template t

– Example: path template t = (*.Restaurant) x1 x2 Name x3 x4

• Query path instantiations:

– q1 = (*.Restaurant) x1 * x2 Name x3 Fridays x4

– q2 = (*.Restaurant) x1 * x2 Name x3 _ x4 ( _ is a predicate with True)

– q3 = (*.Restaurant) x1 ( | _ ) x2 Name x3 Fridays x4

P F

P F

P F

1-Indexes• Goal: compute efficiently queries q inst( x)

• A first attempt:

• Lu is the set of words on path reachable from root to u.

• That is, all the path queries that lead to u.

uV. Lu = {a1…an | v0 … vn DB, v0Root, vn=u}

u,vV. u v Lu = Lv

That is, u and v are indistinguishable by path queries from root.

uV.

[u] = {v | u v} is a equivalence class containing u

a1 an

P

1-Indexes

Nodes(I) = { [u] | u in nodes(DB) }Edges(I) = { [u] [u] | u [u], u [u], (u u) Edges(DB)}Roots(I) = { [r] | r roots(DB) }

I =

q(DB) = { u | [u] q(I), u [u] }

Example:

That is, there will be an edge e in the index tree between s and s’ if there is an edge e between a node in s and a node in s’. if Inefficient: construction cost

aa

Analyzing1-Indexes• Storing I-index

– Associate an oid s to each node in I

– Store graph I in standard form

– Store for each node s, extent(s)• Extent(s) = { [v] | s is an oid for [v] }

• Always: size(I) <= size(DB) (unlike Dataguide)• Always: can compute in O(nlogn) time n=size(DB)• When DB is a tree

– 1-index = Dataguide = XSet

Analyzing1-Indexes

• Do we have size(I) << size(DB) ? No. Two worst cases:

• Facts:– in theory: except for these two DB’s, size(I) << size(DB)

– in practice: it’s a different story. Experiments: size(I) 1/3 size(DB)

Evaluating Query Paths with 1-indexes

• Example: evaluate query path P x– q(DB) = q(I)

– Let Nodes(I) = {s1, s2, … , sk | each si, 1 i k, satisfies query path P x}

– q(DB) = extent(s1) extent(s2) … extent(sk)

Evaluating Query Paths with 1-indexes

• Example: query q = t.a x

• The evaluation of q follows two paths t.a in I rather than five in DB and unions their extents: {7,13} {8,10,12}

• The extents in strong data guide overlap, hence storage may be larger

2-Indexes

• Database: DB = (V, E, Roots)• Queries: select x1, x2 from * x1 P x2, with P a regular path

expression• Template: * x1 x2. • Find: pairs of nodes (x1, x2)• L(u,v) set of words on the path between (u,v)

L(u,v) = {a1 … an | u … v in DB}

(u,v) (u,v) L(u,v) = L(u,v), that is, they are indistingushable by path queries of the form root * x1 x2.

P

a1an

P

2-IndexesNodes(I) = {[(u,v)] | u,v Nodes(DB) }

I2 = Roots(I) = { [(u,u)] | u Nodes(DB) }

Edges(I) = { [(u,v)] [(u,v)] | v v Edges(DB) }

• Storing I2

– The graph – Extent(s) = [(v,u)], for each node s representing the equivalence class [(v,u)]

• L(v,u)(DB) = L[(v,u)](I2), – L(v,u)(DB) represents paths between v and u– L[(v,u)](I2) represents the paths in the 2-index I2, between some root of the index

and [(v,u)]• Query evaluation

– To compute select x, y from * x P y, we compute the query path P y on I2 and take the union of the extents.

– This saves the * search, but may have to start at several roots in I2, which is only one in case of acyclic databases

a a

2-Index: Example

• Cost: size(I) O(n2)

• May be less in practice, similar to PAT trees (Patricia tree) for text databases

Conclusions• work on structured text: relevant but

restrictive

• trees are simple: XSet = Dataguides = 1-index (conceptually)

• 1-index: scales to cyclic data too

• more complex queries: 2-index, T-index

managing xml and semistructured data lecture : indexes

Documents