master informatique 1 qiuyue wangxml data management structure indexes for xml
TRANSCRIPT
Master Informatique 1Qiuyue Wang XML Data Management
Structure Indexes for XML
Structure Indexes for XML
Master Informatique 2Qiuyue Wang XML Data Management
Structure Indexes for XML
Kinds of Indexes• Value Indexes
– index atomic values; e.g., data(//emp/salary)– use B+ trees (like in relational world)– (integration into query optimizer more tricky)
• Structure Indexes– materialize results of path expressions– (pendant to join indexes, path indices)
• Full Text indexes– Keyword search, inverted files– (IR world, text extenders)
Master Informatique 3Qiuyue Wang XML Data Management
Structure Indexes for XML
Value Indexes: Open Questions
• What is the key of the index? (Physical Design)– singletons vs. sequences– string vs. typed-value
• which type? (even for homogeneous domains)• heterogeneous domain
– composite indexes
• Index for what comparison? (Physical Design)– =: virually impossible due to implicit cast + exists– eq, leq, …: problems with implicit casts
• When is a value index applicable? (Compiler)
Master Informatique 4Qiuyue Wang XML Data Management
Structure Indexes for XML
Structure Index: Examples
• DataGuides• 1-Index• APEX• Index Fabric• ……
Master Informatique 5Qiuyue Wang XML Data Management
Structure Indexes for XML
Semi-Structured Data Model
Object Exchange Model (OEM)
&r1
&c1 &c2
&s2 &s3 &s6 &s7
&s10
companycompany
nameaddress name
url
address
“Widget” “Trenton” “Gadget”
“www.gp.fr”
“Paris”
&p2&p1 &p3
&s0 &s1 &s4 &s5 &s8 &s9
personperson
person
“Smith”
nameposition name phonename
position
“Manager” “Jones” “5552121” “Dupont” “Sales”
&a1
&a2 &a3
&a4&a5
&a6&a7
description
description
procurement salesrep
contact
task
eval1997
1998
“on target”
“below target”
Master Informatique 6Qiuyue Wang XML Data Management
Structure Indexes for XML
SS vs. XML Data Models
• Semi-Structured Data– Edge-labeled graph
• XML data– Node-labeled ordered tree
Master Informatique 7Qiuyue Wang XML Data Management
Structure Indexes for XML
DataGuides
• Given a semistructured/XML database instance DB, a DataGuide for DB is a graph G such that:– Every label path in DB also occurs in G
• Complete coverage
– Every label path in G also occurs in DB• Accurate coverage (no bogus path)
– Every label path in G (starting from a particular object) is unique (i.e., G is a DFA)
• Efficient search: to process a label path of length n, just examine n nodes in G
Master Informatique 8Qiuyue Wang XML Data Management
Structure Indexes for XML
DataGuide Example
12
13 14
15 16 17 18 19
NameEntree
Phone
OwnerManager
Restaurant Bar
13 ={2,3}
18=19={8}
14={4}
17={7}15={5,9} 16={6,10,11}
12={1}
Each node in the DataGuide canpoint to a set of database nodes
Master Informatique 9Qiuyue Wang XML Data Management
Structure Indexes for XML
Multiple DataGuides for Same Data
Master Informatique 10Qiuyue Wang XML Data Management
Structure Indexes for XML
Strong DataGuides
• Let p, p’ be two label path expressions and G a graph; define p ≡G p’. if p(G) = p’(G)
– That is, p and p’ are indistinguishable on G
• DG is a strong DataGuide for a database DB if the equivalence relations ≡DG and ≡DB are the same
• Example:G1 is strong; G2 is not
A.C(DB) = { 5 }, B.C(DB) = { 6, 7 }
A.C(G2) = { 20 }, B.C(G2) = { 20 }
Master Informatique 11Qiuyue Wang XML Data Management
Structure Indexes for XML
Size of DataGuides
• If DB is a tree, then | G | ≤ | DB |– Linear construction time
• In the worst case, however, the size of a strong DataGuide may be exponential in | DB |
Master Informatique 12Qiuyue Wang XML Data Management
Structure Indexes for XML
A First Attempt at 1-Index
• Equivalence relation ≡ on the nodes in DB: – u≡v if u and v are reachable by the exactly same set of paths starting
from the root.
• Index is also a graph (no bigger than DB)– Each index node corresponds to an equivalent class; it points to the
set of DB nodes in that equivalent class.– There is an index edge labeled e from s to s’. if there is a DB edge
labeled e from a node in s to a node in s’.
• Any accurate index should have at least this many nodes• Expensive to construct (PSPACE-complete)
Master Informatique 13Qiuyue Wang XML Data Management
Structure Indexes for XML
1-Index
Idea: use simulation/bi-simulation instead of ≡• Stronger conditions finer equivalence
classes more index nodes• Simulation and bi-simulation are much easier
to compute (PTIME)– To be practical, still need
• External-memory construction algorithm• Incremental index update algorithm
Master Informatique 14Qiuyue Wang XML Data Management
Structure Indexes for XML
Simulation
• Given two edge-labeled graphs G1, G2, a simulation is a binary relation on their nodes, denoted as ≤, s.t.,
– if x1 ≤ x2 and (x1, a, y1) is an edge in G1, then there exists an edge (x2, a, y2) in G2 (same label) such that y1 ≤ y2.
x1 x2
a
≤
G1 G2
y1
a≤
y2
Master Informatique 15Qiuyue Wang XML Data Management
Structure Indexes for XML
Bisimulation• Given two edge-labeled graphs G1, G2, a
bisimulation is a relation between their nodes, denoted as , s.t. – if x1 x2 and (x1, a, y1) is an edge in G1, then
there exists an edge (x2, a, y2) in G2 (same label) such that y1 y2; and vice versa
• equivalence relation
Master Informatique 16Qiuyue Wang XML Data Management
Structure Indexes for XML
Simulation/Bisimulation
• Two nodes u and v are bisimilar (u ≈b v) if they are related in some bisimulation
• Two nodes u and v are similar (u ≈s v) if there are two simulations ~ and ~’ s.t. u ~ v and v ~’ u
• Fact: u ≈b v ⇒ u ≈s v ⇒ u ≡ v– Why?
Master Informatique 17Qiuyue Wang XML Data Management
Structure Indexes for XML
Computing a (Bi)Simulation
• The empty set is always a (bi)simulation• If R, R’ are (bi)simulations, so is R U R’. Hence, there
always exists a maximal (bi)simulation.• Computing the maximal (bi)simulation:
– start with R = nodes(G1) x nodes(G2)– while there exists (x1, x2) ∈ R that violates the definition,
remove (x1, x2) from R
• This runs in polynomial time O(mn)!– Better:
• O((m+n)log(m+n)) for bisimulation [Paige and Tarjan 87]• O(mn) for simulation [Henzinger, et a. 1995]
Master Informatique 18Qiuyue Wang XML Data Management
Structure Indexes for XML
1-Index Example
(a) A data graph, (b) its 1-index, (c) its strong DataGuide
Master Informatique 19Qiuyue Wang XML Data Management
Structure Indexes for XML
Analyzing 1-Index
• For a tree-structured DB, 1-indexes using ≈b, ≈s, ≡ are all identical to DataGuide
• Always: size(1-index) ≤ size(DB)– Unlike DataGuide– But we are back to NFA; is lookup time bounded?
• Always: can construct index in O(|DB| log|DB|)• Still need: external-memory construction algorithm and
incremental update algorithm• Designed to answer arbitrarily complex path expressions, but
such expressions may not show up often in queries
Master Informatique 20Qiuyue Wang XML Data Management
Structure Indexes for XML
More on Graph Indexing
Graph indexing:1. Partition nodes into equivalence classes2. Store the extent of each equivalence class, use it as "pre-cooked"
answer to some queries
Equivalence notions:1. Reachable by some common paths: DataGuide [MW97]2. Reachable by exactly the same paths, or equivalently,
indistinguishable by any forward path expression: 1-index [MS99] 3. Indistinguishable by any (forward and backward) path expression: F&B
Index [ABS99,KBN+02]4. Indistinguishable by the (forward and backward) path expressions in
the set Q: covering index [KBN+02] 5. Indistinguishable by any path expression of length < k: A(k) index
[KSB+02]
Master Informatique 21Qiuyue Wang XML Data Management
Structure Indexes for XML
Adaptive Path Indexing: APEX
• Most indexing work indexes all possible paths in the data, but few paths actually come up in queries.
• Index only the frequently used paths (mined from a query workload).
• Chung et al., “APEX: An Adaptive Path Index for XML Data”, SIGMOD 2002– Efficient processing of partial matching queries starting
with self-or-descendant axis. – Workload-aware path indexes.– Incremental update
Master Informatique 22Qiuyue Wang XML Data Management
Structure Indexes for XML
Components of APEX• Graph structure: GAPEX
– Represents the structural summary of XML data with extents (a set of edges whose ending nodes are the result of a label path expression).
• Hash tree: HAPEX
– Keeps the information for frequently used paths and their corresponding nodes in GAPEX.
Master Informatique 24Qiuyue Wang XML Data Management
Structure Indexes for XML
Example APEXHAPEX GAPEX
Query = //actor/name
search the query in hash tree must in reverse order extent:
&1 = {<0,1>}&2 = {<1,2>, <1,4>, <16,2>}&3 = {<1,7>, <1,12>, <9,7>, <15,12>} …
Master Informatique 25Qiuyue Wang XML Data Management
Structure Indexes for XML
Construction & Management of APEX
Master Informatique 26Qiuyue Wang XML Data Management
Structure Indexes for XML
APEX0: Initial Index StructureHAPEX
GAPEX
&0
&1
&2
movieDB
actor
&3
name
&4@movie &5
movie
movie
actor
&9
@actor
&8
title
&6
@director
name
&7
director
movie
director
label xnode
xroot &0
movieDB &1
actor &2
name &3
@movie &4
movie &5
@director &6
dicrector &7
title &8
&actor &9
Master Informatique 27Qiuyue Wang XML Data Management
Structure Indexes for XML
Frequently Used Path Extraction
Label count xnode next
xroot &0
A &1
B &2
C &3
D
Label count xnode next
xroot 0 &0
A 2 &1
B 0 &2
C 1 &3
D 2
Label count xnode next
B &4
Remainder &5
Label count xnode next
A 2 NULL
B 0 &4
Remainder &5
(a) Current state
(b) After frequency count
Qworkload={ A.D, C, A.D }
• Extend HAPEX with a count field.
• Simply counts all subsequences that appear in query workload.
Master Informatique 28Qiuyue Wang XML Data Management
Structure Indexes for XML
Frequently Used Path Extraction
Label count xnode next
xroot 0 &0
A 2 &1
B 0 &2
C 1 &3
D 2
Label count xnode next
A 2 NULL
B 0 &4
Remainder NULL
minsup = 2
(c) After pruning
• if count < minsup then delete.• but if node is head-node then can't delete.• if has nodes deleted or created then
“Remainder” = NULL
Master Informatique 29Qiuyue Wang XML Data Management
Structure Indexes for XML
Update APEX
• Basic idea– traverse the nodes in GAPEX.– update not only the structure of GAPEX with frequently
used paths but also the xnode field of entries in HAPEX.• Recursively calling updateAPEX(xnode, ΔEset, path);
– xnode: a node in GAPEX
– ΔEset: set of new edges added to the extent of the xnode– path: the incoming label path from the root to the xnode
Master Informatique 30Qiuyue Wang XML Data Management
Structure Indexes for XML
Before Update&0
&1
&2
&4
&3
&5
A
B C
D
D
D
Label count xnode next
xroot &0
A &1
B &2
C &3
D
Label count xnode next
A NULL
Remainder NULLextent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}
Label count xnode next
xroot &0
A &1
B &2
C &3
D
After identifying frequently used paths and pruning
Master Informatique 31Qiuyue Wang XML Data Management
Structure Indexes for XML
Update APEX&0
&1
&2
&6
&3
&5
A
B C
D
D
D
Label count xnode next
xroot &0
A &1
B &2
C &3
D
Label count xnode next
A NULL
Remainder &6
extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}&6: {<2,3>}
Master Informatique 32Qiuyue Wang XML Data Management
Structure Indexes for XML
Update APEX&0
&1
&2
&6
&3
&7
A
B C
D
D
D
Label count xnode next
xroot &0
A &1
B &2
C &3
D
Label count xnode next
A &7
Remainder &6
extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&5: {<1,4>,<5,6>}&6: {<2,3>}&7: {<1,4>}
&5
Master Informatique 33Qiuyue Wang XML Data Management
Structure Indexes for XML
After Update&0
&1
&2
&6
&3&7
A
B C
D
D
D
Label count xnode next
xroot &0
A &1
B &2
C &3
D
Label count xnode next
A &7
Remainder &6
extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&6: {<2,3>,<5,6>}&7: {<1,4>}
Master Informatique 34Qiuyue Wang XML Data Management
Structure Indexes for XML
Index Fabric
• B.Cooper, N.Sample, M.Franklin, et al., “A Fast Index for Semistructured Data”, VLDB 2001
• Tree Structured Data• Conceptual similar to strong DataGuide• Layered structure• Use Patricia trie to index a large number of search
keys– The simple path of an element which has a data value is
encoded as a special character sequence– Keeps the key which is the combination of encoded
sequence and data value.
Master Informatique 35Qiuyue Wang XML Data Management
Structure Indexes for XML
Encoding Paths w/Designators
ABC Corp.
123 ABC Way
17 Main St.
Goods Inc.
widget
thingy
jobber
Invoice as a tree
Invoice
Buyer Seller Itemlist
Name Address Item
ABC Corp. 123 ABC Way Goods Inc. 17 Main St.
widget thingy jobber
Name Address Item Item
Master Informatique 36Qiuyue Wang XML Data Management
Structure Indexes for XML
Index Fabric• An index structure for long strings.
– Provides fast lookups– Handles long strings– Ideal substrate for designated keys
• Based on Patricia tries– Highly compressed string representation– Cost in index independent of string length– But, need to balance.
Master Informatique 37Qiuyue Wang XML Data Management
Structure Indexes for XML
Patricia Tries
Indexes first point of difference between keys
greenbeans
greentea
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
D. R. Morrison. “PATRICIA – Practical algorithm to retrieve information coded in alphanumeric.” J. ACM, 15 (1968) pp. 514-534
Master Informatique 38Qiuyue Wang XML Data Management
Structure Indexes for XML
Layered Approach
• Index Fabric improves Patricia tries and make it balanced and optimized for disk-based access like B-tree
• Each query accesses the same number of layers• The index can have as many layers as necessary, the
highest layer always contains one block• The keys are stored very compactly, and blocks have
a very high out-degree. In practice, three layers is enough to store billions of keys
Master Informatique 39Qiuyue Wang XML Data Management
Structure Indexes for XML
Balancing Patricia tries
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
Master Informatique 40Qiuyue Wang XML Data Management
Structure Indexes for XML
Balancing Patricia tries
Step 1: divide trie into blocks
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
Master Informatique 41Qiuyue Wang XML Data Management
Structure Indexes for XML
Balancing Patricia tries
Step 2: build another layer
g0
2
Layer 1 Layer 0
e
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
“”
“gr”
“green”
“”
Master Informatique 42Qiuyue Wang XML Data Management
Structure Indexes for XML
Two Kinds of Links
• Labeled far links– Like normal edges in a trie, but connects a node in
one layer to a subtrie in the lower layer
• Unlabeled direct links– Connects a node in one layer to a block with a
node representing the same prefix in the lower layer
Master Informatique 43Qiuyue Wang XML Data Management
Structure Indexes for XML
Balancing Patricia tries
Layer
0
Data
Search
Layer
0
Layer
1La
yer
1
Layer
2La
yer
2
Layer
3
Master Informatique 44Qiuyue Wang XML Data Management
Structure Indexes for XML
Searching
Search for “greenbeans”
g0
2
Layer 1 Layer 0
e
greenbeans
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
Master Informatique 45Qiuyue Wang XML Data Management
Structure Indexes for XML
Searching
g0
2
Layer 1 Layer 0
e
greenbeans
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
Search for “greenbeans”
Master Informatique 46Qiuyue Wang XML Data Management
Structure Indexes for XML
Searching
g0
2
Layer 1 Layer 0
e greenbeans
g c
r w
0
22
corn cow
a
2grass
5
e
b t
greenbeans greentea
Search for “greenbeans”
Master Informatique 47Qiuyue Wang XML Data Management
Structure Indexes for XML
Designators
• Designator– A unique special character or characters assign to
each tag in XML• Designator dictionary
– Maintain the mapping between designators and XML tags
• Insert the designator-encoded XML string into Index Fabric
• XML Tags in queries are translated into designators, and to form a search key over the Index Fabric
Master Informatique 48Qiuyue Wang XML Data Management
Structure Indexes for XML
Raw Paths
• Index the hierarchical structure of the XML by encoding a root-to-leaf path as a string
• Treat attribute like tagged children, but use different designators to distinguish the same name tag and attribute
• Can use alternate designators to encode the ordering of tags in the XML documents
Master Informatique 49Qiuyue Wang XML Data Management
Structure Indexes for XML
Example XML DataDoc 1:<invoice> <buyer> <name>ABC Corp</name> <address>1 Industrial Way</address> </buyer> <seller> <name>Acme Inc</name> <address>2 Acme Rd.</address> </seller> <item count=3>saw</item> <item count=2>drill</item></invoice>
Doc 2: <invoice> <buyer> <name>Oracle Inc</name> <phone>555-1212</phone> </buyer> <seller> <name>IBM Corp</name> </seller> <item> <count>4</count> <name>nail</name> </item></invoice>
Master Informatique 50Qiuyue Wang XML Data Management
Structure Indexes for XML
Designators and Raw Paths
<invoce>=I<buyer>=B<name>=N<address>=A<seller>=S<item>=T<phone>=P<count>=CCount attribute = C´
Master Informatique 52Qiuyue Wang XML Data Management
Structure Indexes for XML
Refined Paths
• Refined paths are specialized paths through XML that optimize frequent access patterns
• Support queries that have wildcards, alternates, and different constants
• DBA decides which refined paths are appropriate• Both raw paths and refined paths are stored in the
same index and accessed using string lookup
Master Informatique 53Qiuyue Wang XML Data Management
Structure Indexes for XML
Refined Paths• Optimize specific access paths
“ Find invoices where X sold to Y ”
“ Find invoices where X bought Y and Z”
“ Find invoices where a buyer bought X, Y and Z ”
X Y ABC Corp. Goods Inc.
XYZ Corp. Acme Inc.
ABC Corp. jobber widget
XYZ Corp. drill hammer X Y Z
X Y Z jobber thingy widget drill hammer nail
Master Informatique 54Qiuyue Wang XML Data Management
Structure Indexes for XML
Refined Paths
• Two steps to create a refined path index:– The XML documents are parsed to extract
information matching the access pattern of the refined path;
– The information is encoded as an Index Fabric key and inserted into the index.
Master Informatique 55Qiuyue Wang XML Data Management
Structure Indexes for XML
Summary of Index Fabric
• Index Fabric combines the advantages of both Patricia tries and B-trees– Scaling property of Patricia tries– Balanced and optimized for disk-based access like B-trees
• Index Fabric can be built over relational database– Leverage the maturity of relational database
• Not need a priori knowledge of the schema of data– Can handle XML data without DTD