master informatique 1 qiuyue wangxml data management structure indexes for xml

Master Informatique 1Qiuyue Wang XML Data Management

Structure Indexes for XML




Kinds of Indexes• Value Indexes

– index atomic values; e.g., data(//emp/salary)– use B+ trees (like in relational world)– (integration into query optimizer more tricky)

• Structure Indexes– materialize results of path expressions– (pendant to join indexes, path indices)

• Full Text indexes– Keyword search, inverted files– (IR world, text extenders)



Value Indexes: Open Questions

• What is the key of the index? (Physical Design)– singletons vs. sequences– string vs. typed-value

• which type? (even for homogeneous domains)• heterogeneous domain

– composite indexes

• Index for what comparison? (Physical Design)– =: virually impossible due to implicit cast + exists– eq, leq, …: problems with implicit casts

• When is a value index applicable? (Compiler)



Structure Index: Examples

• DataGuides• 1-Index• APEX• Index Fabric• ……



Semi-Structured Data Model

Object Exchange Model (OEM)

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition name phonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”



SS vs. XML Data Models

• Semi-Structured Data– Edge-labeled graph

• XML data– Node-labeled ordered tree



DataGuides

• Given a semistructured/XML database instance DB, a DataGuide for DB is a graph G such that:– Every label path in DB also occurs in G

• Complete coverage

– Every label path in G also occurs in DB• Accurate coverage (no bogus path)

– Every label path in G (starting from a particular object) is unique (i.e., G is a DFA)

• Efficient search: to process a label path of length n, just examine n nodes in G



DataGuide Example

12

13 14

15 16 17 18 19

NameEntree

Phone

OwnerManager

Restaurant Bar

13 ={2,3}

18=19={8}

14={4}

17={7}15={5,9} 16={6,10,11}

12={1}

Each node in the DataGuide canpoint to a set of database nodes



Multiple DataGuides for Same Data



Strong DataGuides

• Let p, p’ be two label path expressions and G a graph; define p ≡G p’. if p(G) = p’(G)

– That is, p and p’ are indistinguishable on G

• DG is a strong DataGuide for a database DB if the equivalence relations ≡DG and ≡DB are the same

• Example:G1 is strong; G2 is not

A.C(DB) = { 5 }, B.C(DB) = { 6, 7 }

A.C(G2) = { 20 }, B.C(G2) = { 20 }



Size of DataGuides

• If DB is a tree, then | G | ≤ | DB |– Linear construction time

• In the worst case, however, the size of a strong DataGuide may be exponential in | DB |



A First Attempt at 1-Index

• Equivalence relation ≡ on the nodes in DB: – u≡v if u and v are reachable by the exactly same set of paths starting

from the root.

• Index is also a graph (no bigger than DB)– Each index node corresponds to an equivalent class; it points to the

set of DB nodes in that equivalent class.– There is an index edge labeled e from s to s’. if there is a DB edge

labeled e from a node in s to a node in s’.

• Any accurate index should have at least this many nodes• Expensive to construct (PSPACE-complete)



1-Index

Idea: use simulation/bi-simulation instead of ≡• Stronger conditions finer equivalence

classes more index nodes• Simulation and bi-simulation are much easier

to compute (PTIME)– To be practical, still need

• External-memory construction algorithm• Incremental index update algorithm



Simulation

• Given two edge-labeled graphs G1, G2, a simulation is a binary relation on their nodes, denoted as ≤, s.t.,

– if x1 ≤ x2 and (x1, a, y1) is an edge in G1, then there exists an edge (x2, a, y2) in G2 (same label) such that y1 ≤ y2.

x1 x2

a

≤

G1 G2

y1

a≤

y2



Bisimulation• Given two edge-labeled graphs G1, G2, a

bisimulation is a relation between their nodes, denoted as , s.t. – if x1 x2 and (x1, a, y1) is an edge in G1, then

there exists an edge (x2, a, y2) in G2 (same label) such that y1 y2; and vice versa

• equivalence relation



Simulation/Bisimulation

• Two nodes u and v are bisimilar (u ≈b v) if they are related in some bisimulation

• Two nodes u and v are similar (u ≈s v) if there are two simulations ~ and ~’ s.t. u ~ v and v ~’ u

• Fact: u ≈b v ⇒ u ≈s v ⇒ u ≡ v– Why?



Computing a (Bi)Simulation

• The empty set is always a (bi)simulation• If R, R’ are (bi)simulations, so is R U R’. Hence, there

always exists a maximal (bi)simulation.• Computing the maximal (bi)simulation:

– start with R = nodes(G1) x nodes(G2)– while there exists (x1, x2) ∈ R that violates the definition,

remove (x1, x2) from R

• This runs in polynomial time O(mn)!– Better:

• O((m+n)log(m+n)) for bisimulation [Paige and Tarjan 87]• O(mn) for simulation [Henzinger, et a. 1995]



1-Index Example

(a) A data graph, (b) its 1-index, (c) its strong DataGuide



Analyzing 1-Index

• For a tree-structured DB, 1-indexes using ≈b, ≈s, ≡ are all identical to DataGuide

• Always: size(1-index) ≤ size(DB)– Unlike DataGuide– But we are back to NFA; is lookup time bounded?

• Always: can construct index in O(|DB| log|DB|)• Still need: external-memory construction algorithm and

incremental update algorithm• Designed to answer arbitrarily complex path expressions, but

such expressions may not show up often in queries



More on Graph Indexing

Graph indexing:1. Partition nodes into equivalence classes2. Store the extent of each equivalence class, use it as "pre-cooked"

answer to some queries

Equivalence notions:1. Reachable by some common paths: DataGuide [MW97]2. Reachable by exactly the same paths, or equivalently,

indistinguishable by any forward path expression: 1-index [MS99] 3. Indistinguishable by any (forward and backward) path expression: F&B

Index [ABS99,KBN+02]4. Indistinguishable by the (forward and backward) path expressions in

the set Q: covering index [KBN+02] 5. Indistinguishable by any path expression of length < k: A(k) index

[KSB+02]



Adaptive Path Indexing: APEX

• Most indexing work indexes all possible paths in the data, but few paths actually come up in queries.

• Index only the frequently used paths (mined from a query workload).

• Chung et al., “APEX: An Adaptive Path Index for XML Data”, SIGMOD 2002– Efficient processing of partial matching queries starting

with self-or-descendant axis. – Workload-aware path indexes.– Incremental update



Components of APEX• Graph structure: GAPEX

– Represents the structural summary of XML data with extents (a set of edges whose ending nodes are the result of a label path expression).

• Hash tree: HAPEX

– Keeps the information for frequently used paths and their corresponding nodes in GAPEX.



Example XML Data



Example APEXHAPEX GAPEX

Query = //actor/name

search the query in hash tree must in reverse order extent:

&1 = {<0,1>}&2 = {<1,2>, <1,4>, <16,2>}&3 = {<1,7>, <1,12>, <9,7>, <15,12>} …



Construction & Management of APEX



APEX0: Initial Index StructureHAPEX

GAPEX

&0

&1

&2

movieDB

actor

&3

name

&4@movie &5

movie

movie

actor

&9

@actor

&8

title

&6

@director

name

&7

director

movie

director

label xnode

xroot &0

movieDB &1

actor &2

name &3

@movie &4

movie &5

@director &6

dicrector &7

title &8

&actor &9



Frequently Used Path Extraction

Label count xnode next

xroot &0

A &1

B &2

C &3

D


xroot 0 &0

A 2 &1

B 0 &2

C 1 &3

D 2


B &4

Remainder &5


A 2 NULL

B 0 &4

Remainder &5

(a) Current state

(b) After frequency count

Qworkload={ A.D, C, A.D }

• Extend HAPEX with a count field.

• Simply counts all subsequences that appear in query workload.



Frequently Used Path Extraction


xroot 0 &0

A 2 &1

B 0 &2

C 1 &3

D 2


A 2 NULL

B 0 &4

Remainder NULL

minsup = 2

(c) After pruning

• if count < minsup then delete.• but if node is head-node then can't delete.• if has nodes deleted or created then

“Remainder” = NULL



Update APEX

• Basic idea– traverse the nodes in GAPEX.– update not only the structure of GAPEX with frequently

used paths but also the xnode field of entries in HAPEX.• Recursively calling updateAPEX(xnode, ΔEset, path);

– xnode: a node in GAPEX

– ΔEset: set of new edges added to the extent of the xnode– path: the incoming label path from the root to the xnode



Before Update&0

&1

&2

&4

&3

&5

A

B C

D

D

D


xroot &0

A &1

B &2

C &3

D


A NULL

Remainder NULLextent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}


xroot &0

A &1

B &2

C &3

D

After identifying frequently used paths and pruning



Update APEX&0

&1

&2

&6

&3

&5

A

B C

D

D

D


xroot &0

A &1

B &2

C &3

D


A NULL

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}&6: {<2,3>}



Update APEX&0

&1

&2

&6

&3

&7

A

B C

D

D

D


xroot &0

A &1

B &2

C &3

D


A &7

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&5: {<1,4>,<5,6>}&6: {<2,3>}&7: {<1,4>}

&5



After Update&0

&1

&2

&6

&3&7

A

B C

D

D

D


xroot &0

A &1

B &2

C &3

D


A &7

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&6: {<2,3>,<5,6>}&7: {<1,4>}



Index Fabric

• B.Cooper, N.Sample, M.Franklin, et al., “A Fast Index for Semistructured Data”, VLDB 2001

• Tree Structured Data• Conceptual similar to strong DataGuide• Layered structure• Use Patricia trie to index a large number of search

keys– The simple path of an element which has a data value is

encoded as a special character sequence– Keeps the key which is the combination of encoded

sequence and data value.



Encoding Paths w/Designators

ABC Corp.

123 ABC Way

17 Main St.

Goods Inc.

widget

thingy

jobber

Invoice as a tree

Invoice

Buyer Seller Itemlist

Name Address Item

ABC Corp. 123 ABC Way Goods Inc. 17 Main St.

widget thingy jobber

Name Address Item Item



Index Fabric• An index structure for long strings.

– Provides fast lookups– Handles long strings– Ideal substrate for designated keys

• Based on Patricia tries– Highly compressed string representation– Cost in index independent of string length– But, need to balance.



Patricia Tries

Indexes first point of difference between keys

greenbeans

greentea

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

D. R. Morrison. “PATRICIA – Practical algorithm to retrieve information coded in alphanumeric.” J. ACM, 15 (1968) pp. 514-534



Layered Approach

• Index Fabric improves Patricia tries and make it balanced and optimized for disk-based access like B-tree

• Each query accesses the same number of layers• The index can have as many layers as necessary, the

highest layer always contains one block• The keys are stored very compactly, and blocks have

a very high out-degree. In practice, three layers is enough to store billions of keys



Balancing Patricia tries

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea




Step 1: divide trie into blocks

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea




Step 2: build another layer

g0

2

Layer 1 Layer 0

e

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

“”

“gr”

“green”

“”



Two Kinds of Links

• Labeled far links– Like normal edges in a trie, but connects a node in

one layer to a subtrie in the lower layer

• Unlabeled direct links– Connects a node in one layer to a block with a

node representing the same prefix in the lower layer




Layer

0

Data

Search

Layer

0

Layer

1La

yer

1

Layer

2La

yer

2

Layer

3



Searching

Search for “greenbeans”

g0

2

Layer 1 Layer 0

e

greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea



Searching

g0

2

Layer 1 Layer 0

e

greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea




Searching

g0

2

Layer 1 Layer 0

e greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea




Designators

• Designator– A unique special character or characters assign to

each tag in XML• Designator dictionary

– Maintain the mapping between designators and XML tags

• Insert the designator-encoded XML string into Index Fabric

• XML Tags in queries are translated into designators, and to form a search key over the Index Fabric



Raw Paths

• Index the hierarchical structure of the XML by encoding a root-to-leaf path as a string

• Treat attribute like tagged children, but use different designators to distinguish the same name tag and attribute

• Can use alternate designators to encode the ordering of tags in the XML documents



Example XML DataDoc 1:<invoice> <buyer> <name>ABC Corp</name> <address>1 Industrial Way</address> </buyer> <seller> <name>Acme Inc</name> <address>2 Acme Rd.</address> </seller> <item count=3>saw</item> <item count=2>drill</item></invoice>

Doc 2: <invoice> <buyer> <name>Oracle Inc</name> <phone>555-1212</phone> </buyer> <seller> <name>IBM Corp</name> </seller> <item> <count>4</count> <name>nail</name> </item></invoice>



Designators and Raw Paths

<invoce>=I<buyer>=B<name>=N<address>=A<seller>=S<item>=T<phone>=P<count>=CCount attribute = C´



Index Fabric



Refined Paths

• Refined paths are specialized paths through XML that optimize frequent access patterns

• Support queries that have wildcards, alternates, and different constants

• DBA decides which refined paths are appropriate• Both raw paths and refined paths are stored in the

same index and accessed using string lookup



Refined Paths• Optimize specific access paths

“ Find invoices where X sold to Y ”

“ Find invoices where X bought Y and Z”

“ Find invoices where a buyer bought X, Y and Z ”

X Y ABC Corp. Goods Inc.

XYZ Corp. Acme Inc.

ABC Corp. jobber widget

XYZ Corp. drill hammer X Y Z

X Y Z jobber thingy widget drill hammer nail



Refined Paths

• Two steps to create a refined path index:– The XML documents are parsed to extract

information matching the access pattern of the refined path;

– The information is encoded as an Index Fabric key and inserted into the index.



Summary of Index Fabric

• Index Fabric combines the advantages of both Patricia tries and B-trees– Scaling property of Patricia tries– Balanced and optimized for disk-based access like B-trees

• Index Fabric can be built over relational database– Leverage the maturity of relational database

• Not need a priori knowledge of the schema of data– Can handle XML data without DTD



Lineage of XML Indexing Schemes

master informatique 1 qiuyue wangxml data management structure indexes for xml

Documents