master informatique 1 qiuyue wangxml data management structure indexes for xml

56
Master Informatique 1 Qiuyue Wang XML Data Management Structure Indexes for XML Structure Indexes for XML

Upload: virgil-walsh

Post on 31-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Master Informatique 1Qiuyue Wang XML Data Management

Structure Indexes for XML

Structure Indexes for XML

Master Informatique 2Qiuyue Wang XML Data Management

Structure Indexes for XML

Kinds of Indexes• Value Indexes

– index atomic values; e.g., data(//emp/salary)– use B+ trees (like in relational world)– (integration into query optimizer more tricky)

• Structure Indexes– materialize results of path expressions– (pendant to join indexes, path indices)

• Full Text indexes– Keyword search, inverted files– (IR world, text extenders)

Master Informatique 3Qiuyue Wang XML Data Management

Structure Indexes for XML

Value Indexes: Open Questions

• What is the key of the index? (Physical Design)– singletons vs. sequences– string vs. typed-value

• which type? (even for homogeneous domains)• heterogeneous domain

– composite indexes

• Index for what comparison? (Physical Design)– =: virually impossible due to implicit cast + exists– eq, leq, …: problems with implicit casts

• When is a value index applicable? (Compiler)

Master Informatique 4Qiuyue Wang XML Data Management

Structure Indexes for XML

Structure Index: Examples

• DataGuides• 1-Index• APEX• Index Fabric• ……

Master Informatique 5Qiuyue Wang XML Data Management

Structure Indexes for XML

Semi-Structured Data Model

Object Exchange Model (OEM)

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition name phonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”

Master Informatique 6Qiuyue Wang XML Data Management

Structure Indexes for XML

SS vs. XML Data Models

• Semi-Structured Data– Edge-labeled graph

• XML data– Node-labeled ordered tree

Master Informatique 7Qiuyue Wang XML Data Management

Structure Indexes for XML

DataGuides

• Given a semistructured/XML database instance DB, a DataGuide for DB is a graph G such that:– Every label path in DB also occurs in G

• Complete coverage

– Every label path in G also occurs in DB• Accurate coverage (no bogus path)

– Every label path in G (starting from a particular object) is unique (i.e., G is a DFA)

• Efficient search: to process a label path of length n, just examine n nodes in G

Master Informatique 8Qiuyue Wang XML Data Management

Structure Indexes for XML

DataGuide Example

12

13 14

15 16 17 18 19

NameEntree

Phone

OwnerManager

Restaurant Bar

13 ={2,3}

18=19={8}

14={4}

17={7}15={5,9} 16={6,10,11}

12={1}

Each node in the DataGuide canpoint to a set of database nodes

Master Informatique 9Qiuyue Wang XML Data Management

Structure Indexes for XML

Multiple DataGuides for Same Data

Master Informatique 10Qiuyue Wang XML Data Management

Structure Indexes for XML

Strong DataGuides

• Let p, p’ be two label path expressions and G a graph; define p ≡G p’. if p(G) = p’(G)

– That is, p and p’ are indistinguishable on G

• DG is a strong DataGuide for a database DB if the equivalence relations ≡DG and ≡DB are the same

• Example:G1 is strong; G2 is not

A.C(DB) = { 5 }, B.C(DB) = { 6, 7 }

A.C(G2) = { 20 }, B.C(G2) = { 20 }

Master Informatique 11Qiuyue Wang XML Data Management

Structure Indexes for XML

Size of DataGuides

• If DB is a tree, then | G | ≤ | DB |– Linear construction time

• In the worst case, however, the size of a strong DataGuide may be exponential in | DB |

Master Informatique 12Qiuyue Wang XML Data Management

Structure Indexes for XML

A First Attempt at 1-Index

• Equivalence relation ≡ on the nodes in DB: – u≡v if u and v are reachable by the exactly same set of paths starting

from the root.

• Index is also a graph (no bigger than DB)– Each index node corresponds to an equivalent class; it points to the

set of DB nodes in that equivalent class.– There is an index edge labeled e from s to s’. if there is a DB edge

labeled e from a node in s to a node in s’.

• Any accurate index should have at least this many nodes• Expensive to construct (PSPACE-complete)

Master Informatique 13Qiuyue Wang XML Data Management

Structure Indexes for XML

1-Index

Idea: use simulation/bi-simulation instead of ≡• Stronger conditions finer equivalence

classes more index nodes• Simulation and bi-simulation are much easier

to compute (PTIME)– To be practical, still need

• External-memory construction algorithm• Incremental index update algorithm

Master Informatique 14Qiuyue Wang XML Data Management

Structure Indexes for XML

Simulation

• Given two edge-labeled graphs G1, G2, a simulation is a binary relation on their nodes, denoted as ≤, s.t.,

– if x1 ≤ x2 and (x1, a, y1) is an edge in G1, then there exists an edge (x2, a, y2) in G2 (same label) such that y1 ≤ y2.

x1 x2

a

G1 G2

y1

a≤

y2

Master Informatique 15Qiuyue Wang XML Data Management

Structure Indexes for XML

Bisimulation• Given two edge-labeled graphs G1, G2, a

bisimulation is a relation between their nodes, denoted as , s.t. – if x1 x2 and (x1, a, y1) is an edge in G1, then

there exists an edge (x2, a, y2) in G2 (same label) such that y1 y2; and vice versa

• equivalence relation

Master Informatique 16Qiuyue Wang XML Data Management

Structure Indexes for XML

Simulation/Bisimulation

• Two nodes u and v are bisimilar (u ≈b v) if they are related in some bisimulation

• Two nodes u and v are similar (u ≈s v) if there are two simulations ~ and ~’ s.t. u ~ v and v ~’ u

• Fact: u ≈b v ⇒ u ≈s v ⇒ u ≡ v– Why?

Master Informatique 17Qiuyue Wang XML Data Management

Structure Indexes for XML

Computing a (Bi)Simulation

• The empty set is always a (bi)simulation• If R, R’ are (bi)simulations, so is R U R’. Hence, there

always exists a maximal (bi)simulation.• Computing the maximal (bi)simulation:

– start with R = nodes(G1) x nodes(G2)– while there exists (x1, x2) ∈ R that violates the definition,

remove (x1, x2) from R

• This runs in polynomial time O(mn)!– Better:

• O((m+n)log(m+n)) for bisimulation [Paige and Tarjan 87]• O(mn) for simulation [Henzinger, et a. 1995]

Master Informatique 18Qiuyue Wang XML Data Management

Structure Indexes for XML

1-Index Example

(a) A data graph, (b) its 1-index, (c) its strong DataGuide

Master Informatique 19Qiuyue Wang XML Data Management

Structure Indexes for XML

Analyzing 1-Index

• For a tree-structured DB, 1-indexes using ≈b, ≈s, ≡ are all identical to DataGuide

• Always: size(1-index) ≤ size(DB)– Unlike DataGuide– But we are back to NFA; is lookup time bounded?

• Always: can construct index in O(|DB| log|DB|)• Still need: external-memory construction algorithm and

incremental update algorithm• Designed to answer arbitrarily complex path expressions, but

such expressions may not show up often in queries

Master Informatique 20Qiuyue Wang XML Data Management

Structure Indexes for XML

More on Graph Indexing

Graph indexing:1. Partition nodes into equivalence classes2. Store the extent of each equivalence class, use it as "pre-cooked"

answer to some queries

Equivalence notions:1. Reachable by some common paths: DataGuide [MW97]2. Reachable by exactly the same paths, or equivalently,

indistinguishable by any forward path expression: 1-index [MS99] 3. Indistinguishable by any (forward and backward) path expression: F&B

Index [ABS99,KBN+02]4. Indistinguishable by the (forward and backward) path expressions in

the set Q: covering index [KBN+02] 5. Indistinguishable by any path expression of length < k: A(k) index

[KSB+02]

Master Informatique 21Qiuyue Wang XML Data Management

Structure Indexes for XML

Adaptive Path Indexing: APEX

• Most indexing work indexes all possible paths in the data, but few paths actually come up in queries.

• Index only the frequently used paths (mined from a query workload).

• Chung et al., “APEX: An Adaptive Path Index for XML Data”, SIGMOD 2002– Efficient processing of partial matching queries starting

with self-or-descendant axis. – Workload-aware path indexes.– Incremental update

Master Informatique 22Qiuyue Wang XML Data Management

Structure Indexes for XML

Components of APEX• Graph structure: GAPEX

– Represents the structural summary of XML data with extents (a set of edges whose ending nodes are the result of a label path expression).

• Hash tree: HAPEX

– Keeps the information for frequently used paths and their corresponding nodes in GAPEX.

Master Informatique 23Qiuyue Wang XML Data Management

Structure Indexes for XML

Example XML Data

Master Informatique 24Qiuyue Wang XML Data Management

Structure Indexes for XML

Example APEXHAPEX GAPEX

Query = //actor/name

search the query in hash tree must in reverse order extent:

&1 = {<0,1>}&2 = {<1,2>, <1,4>, <16,2>}&3 = {<1,7>, <1,12>, <9,7>, <15,12>} …

Master Informatique 25Qiuyue Wang XML Data Management

Structure Indexes for XML

Construction & Management of APEX

Master Informatique 26Qiuyue Wang XML Data Management

Structure Indexes for XML

APEX0: Initial Index StructureHAPEX

GAPEX

&0

&1

&2

movieDB

actor

&3

name

&4@movie &5

movie

movie

actor

&9

@actor

&8

title

&6

@director

name

&7

director

movie

director

label xnode

xroot &0

movieDB &1

actor &2

name &3

@movie &4

movie &5

@director &6

dicrector &7

title &8

&actor &9

Master Informatique 27Qiuyue Wang XML Data Management

Structure Indexes for XML

Frequently Used Path Extraction

Label count xnode next

xroot &0

A &1

B &2

C &3

D

Label count xnode next

xroot 0 &0

A 2 &1

B 0 &2

C 1 &3

D 2

Label count xnode next

B &4

Remainder &5

Label count xnode next

A 2 NULL

B 0 &4

Remainder &5

(a) Current state

(b) After frequency count

Qworkload={ A.D, C, A.D }

• Extend HAPEX with a count field.

• Simply counts all subsequences that appear in query workload.

Master Informatique 28Qiuyue Wang XML Data Management

Structure Indexes for XML

Frequently Used Path Extraction

Label count xnode next

xroot 0 &0

A 2 &1

B 0 &2

C 1 &3

D 2

Label count xnode next

A 2 NULL

B 0 &4

Remainder NULL

minsup = 2

(c) After pruning

• if count < minsup then delete.• but if node is head-node then can't delete.• if has nodes deleted or created then

“Remainder” = NULL

Master Informatique 29Qiuyue Wang XML Data Management

Structure Indexes for XML

Update APEX

• Basic idea– traverse the nodes in GAPEX.– update not only the structure of GAPEX with frequently

used paths but also the xnode field of entries in HAPEX.• Recursively calling updateAPEX(xnode, ΔEset, path);

– xnode: a node in GAPEX

– ΔEset: set of new edges added to the extent of the xnode– path: the incoming label path from the root to the xnode

Master Informatique 30Qiuyue Wang XML Data Management

Structure Indexes for XML

Before Update&0

&1

&2

&4

&3

&5

A

B C

D

D

D

Label count xnode next

xroot &0

A &1

B &2

C &3

D

Label count xnode next

A NULL

Remainder NULLextent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}

Label count xnode next

xroot &0

A &1

B &2

C &3

D

After identifying frequently used paths and pruning

Master Informatique 31Qiuyue Wang XML Data Management

Structure Indexes for XML

Update APEX&0

&1

&2

&6

&3

&5

A

B C

D

D

D

Label count xnode next

xroot &0

A &1

B &2

C &3

D

Label count xnode next

A NULL

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&4: {<2,3>}&5: {<1,4>,<5,6>}&6: {<2,3>}

Master Informatique 32Qiuyue Wang XML Data Management

Structure Indexes for XML

Update APEX&0

&1

&2

&6

&3

&7

A

B C

D

D

D

Label count xnode next

xroot &0

A &1

B &2

C &3

D

Label count xnode next

A &7

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&5: {<1,4>,<5,6>}&6: {<2,3>}&7: {<1,4>}

&5

Master Informatique 33Qiuyue Wang XML Data Management

Structure Indexes for XML

After Update&0

&1

&2

&6

&3&7

A

B C

D

D

D

Label count xnode next

xroot &0

A &1

B &2

C &3

D

Label count xnode next

A &7

Remainder &6

extent&0: {<null,0>}&1: {<0,1>}&2: {<1,2>}&3: {<1,5>}&6: {<2,3>,<5,6>}&7: {<1,4>}

Master Informatique 34Qiuyue Wang XML Data Management

Structure Indexes for XML

Index Fabric

• B.Cooper, N.Sample, M.Franklin, et al., “A Fast Index for Semistructured Data”, VLDB 2001

• Tree Structured Data• Conceptual similar to strong DataGuide• Layered structure• Use Patricia trie to index a large number of search

keys– The simple path of an element which has a data value is

encoded as a special character sequence– Keeps the key which is the combination of encoded

sequence and data value.

Master Informatique 35Qiuyue Wang XML Data Management

Structure Indexes for XML

Encoding Paths w/Designators

ABC Corp.

123 ABC Way

17 Main St.

Goods Inc.

widget

thingy

jobber

Invoice as a tree

Invoice

Buyer Seller Itemlist

Name Address Item

ABC Corp. 123 ABC Way Goods Inc. 17 Main St.

widget thingy jobber

Name Address Item Item

Master Informatique 36Qiuyue Wang XML Data Management

Structure Indexes for XML

Index Fabric• An index structure for long strings.

– Provides fast lookups– Handles long strings– Ideal substrate for designated keys

• Based on Patricia tries– Highly compressed string representation– Cost in index independent of string length– But, need to balance.

Master Informatique 37Qiuyue Wang XML Data Management

Structure Indexes for XML

Patricia Tries

Indexes first point of difference between keys

greenbeans

greentea

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

D. R. Morrison. “PATRICIA – Practical algorithm to retrieve information coded in alphanumeric.” J. ACM, 15 (1968) pp. 514-534

Master Informatique 38Qiuyue Wang XML Data Management

Structure Indexes for XML

Layered Approach

• Index Fabric improves Patricia tries and make it balanced and optimized for disk-based access like B-tree

• Each query accesses the same number of layers• The index can have as many layers as necessary, the

highest layer always contains one block• The keys are stored very compactly, and blocks have

a very high out-degree. In practice, three layers is enough to store billions of keys

Master Informatique 39Qiuyue Wang XML Data Management

Structure Indexes for XML

Balancing Patricia tries

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

Master Informatique 40Qiuyue Wang XML Data Management

Structure Indexes for XML

Balancing Patricia tries

Step 1: divide trie into blocks

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

Master Informatique 41Qiuyue Wang XML Data Management

Structure Indexes for XML

Balancing Patricia tries

Step 2: build another layer

g0

2

Layer 1 Layer 0

e

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

“”

“gr”

“green”

“”

Master Informatique 42Qiuyue Wang XML Data Management

Structure Indexes for XML

Two Kinds of Links

• Labeled far links– Like normal edges in a trie, but connects a node in

one layer to a subtrie in the lower layer

• Unlabeled direct links– Connects a node in one layer to a block with a

node representing the same prefix in the lower layer

Master Informatique 43Qiuyue Wang XML Data Management

Structure Indexes for XML

Balancing Patricia tries

Layer

0

Data

Search

Layer

0

Layer

1La

yer

1

Layer

2La

yer

2

Layer

3

Master Informatique 44Qiuyue Wang XML Data Management

Structure Indexes for XML

Searching

Search for “greenbeans”

g0

2

Layer 1 Layer 0

e

greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

Master Informatique 45Qiuyue Wang XML Data Management

Structure Indexes for XML

Searching

g0

2

Layer 1 Layer 0

e

greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

Search for “greenbeans”

Master Informatique 46Qiuyue Wang XML Data Management

Structure Indexes for XML

Searching

g0

2

Layer 1 Layer 0

e greenbeans

g c

r w

0

22

corn cow

a

2grass

5

e

b t

greenbeans greentea

Search for “greenbeans”

Master Informatique 47Qiuyue Wang XML Data Management

Structure Indexes for XML

Designators

• Designator– A unique special character or characters assign to

each tag in XML• Designator dictionary

– Maintain the mapping between designators and XML tags

• Insert the designator-encoded XML string into Index Fabric

• XML Tags in queries are translated into designators, and to form a search key over the Index Fabric

Master Informatique 48Qiuyue Wang XML Data Management

Structure Indexes for XML

Raw Paths

• Index the hierarchical structure of the XML by encoding a root-to-leaf path as a string

• Treat attribute like tagged children, but use different designators to distinguish the same name tag and attribute

• Can use alternate designators to encode the ordering of tags in the XML documents

Master Informatique 49Qiuyue Wang XML Data Management

Structure Indexes for XML

Example XML DataDoc 1:<invoice> <buyer> <name>ABC Corp</name> <address>1 Industrial Way</address> </buyer> <seller> <name>Acme Inc</name> <address>2 Acme Rd.</address> </seller> <item count=3>saw</item> <item count=2>drill</item></invoice>

Doc 2: <invoice> <buyer> <name>Oracle Inc</name> <phone>555-1212</phone> </buyer> <seller> <name>IBM Corp</name> </seller> <item> <count>4</count> <name>nail</name> </item></invoice>

Master Informatique 50Qiuyue Wang XML Data Management

Structure Indexes for XML

Designators and Raw Paths

<invoce>=I<buyer>=B<name>=N<address>=A<seller>=S<item>=T<phone>=P<count>=CCount attribute = C´

Master Informatique 51Qiuyue Wang XML Data Management

Structure Indexes for XML

Index Fabric

Master Informatique 52Qiuyue Wang XML Data Management

Structure Indexes for XML

Refined Paths

• Refined paths are specialized paths through XML that optimize frequent access patterns

• Support queries that have wildcards, alternates, and different constants

• DBA decides which refined paths are appropriate• Both raw paths and refined paths are stored in the

same index and accessed using string lookup

Master Informatique 53Qiuyue Wang XML Data Management

Structure Indexes for XML

Refined Paths• Optimize specific access paths

“ Find invoices where X sold to Y ”

“ Find invoices where X bought Y and Z”

“ Find invoices where a buyer bought X, Y and Z ”

X Y ABC Corp. Goods Inc.

XYZ Corp. Acme Inc.

ABC Corp. jobber widget

XYZ Corp. drill hammer X Y Z

X Y Z jobber thingy widget drill hammer nail

Master Informatique 54Qiuyue Wang XML Data Management

Structure Indexes for XML

Refined Paths

• Two steps to create a refined path index:– The XML documents are parsed to extract

information matching the access pattern of the refined path;

– The information is encoded as an Index Fabric key and inserted into the index.

Master Informatique 55Qiuyue Wang XML Data Management

Structure Indexes for XML

Summary of Index Fabric

• Index Fabric combines the advantages of both Patricia tries and B-trees– Scaling property of Patricia tries– Balanced and optimized for disk-based access like B-trees

• Index Fabric can be built over relational database– Leverage the maturity of relational database

• Not need a priori knowledge of the schema of data– Can handle XML data without DTD

Master Informatique 56Qiuyue Wang XML Data Management

Structure Indexes for XML

Lineage of XML Indexing Schemes