indexing and querying xml data for regular path expressions

Indexing and Querying XML Data for Regular Path

Expressions

Quanzhong Li and Bongki Moon Dept. of Computer Science

University of ArizonaVLDB 2001.

Querying XML

XML has tree structured data model.Queries involve navigating data using regular path expressions.(e.g., XPath)e.g. /chapter/-*/figure[@caption=“Tree Frogs”] Accessing all elements with same name

string. Ancestor-descendant relationship between

elements.

Contribution

New system for Indexing XML data.Querying XML data based on a numbering scheme for elementsJoin algorithms for processing complex regular path expressions.

Outline

Numbering schemeIndex structureJoin algorithmsExperimental results

Path expression evaluation

Previous approaches Conventional tree traversals

Disadvantage: Overhead of traversing for long or unknown path lengths.

New approach Indexing for efficient element access. Numbering scheme for ancestor-

descendant relationship.

Dietz’s Numbering Scheme

for two given nodes x and y, x is an ancestor of y, if and only if x occurs before y

in the preorder traversal of T and

after y in postorder traversal.

(1,7)

(2,4)

(3,1) (4,2) (5,3)

(6,6)

(7,5)

Proposed numbering scheme

This associates with each nodea pair of numbers <order,

size>as follows:

For a tree node y and its parent x,

order(x) < order(y) order(y)+size(y) =<

order(x) + size(x)For two sibling nodes x and y, if x is the predecessor of y in preorder traversal then

order(x) + size(x) < order(y)

(1,100)

(10,30)

(11,5) (17,5)(25,5)

(41,10)

(45,5)

Advantages

Efficient Updates• Extra space can be reserved to

accommodate future insertions.

Ancestor–descendant relationship

For two given nodes x and y of a tree T, x is an ancestor of y if and only if order(x) < order(y) =< order(x) +

size(x).

Outline


Index and Data Organization

XML Raw

Data

Document

Loader

Element

Index

Attribute

Index

Structure

Index

Name

Index

Value

Table

Paged File

Query

ProcessorQuery

XISS

Result

Element Index

Element nid

Document ID list

Element list with the

Same name in the

Same Document

B+-tree<Order, Size>

Depth,

Parent ID

Element

Record

Element nid

B+-tree

Structure Index

Document ID

(did)

Array of All Elements

And Attributes in the

Same Document

nid,

<order,size>,

Parent order,

Child order,

Sibling order,

Attribute order

B+-tree

Outline


Regular Path expression

complex regular path expressions. e.g.,

/chapter/_*/figure[@caption=“Tree Frogs”]

Symbol Function of symbol

__ Any single node

/ Union of node

* Zero or more occurrences of a node

@ Denotes attributes

Regular expression Decomposition

A regular path expression can be decomposed to a combination of following basic subexpressions:

1. A subexpression with a single element or a single attribute,

2. A subexpression with an element and an attribute ( e.g., figure[@caption = “Tree Frogs”])

3. A subexpression with two elements (e.g., chapter/figure or chapter/_*/figure),

4. A subexpression with a Kleene closure (+,*) of another subexpression, and

5. A subexpression that is a union of two other subexpressions.

Example ( E1 / E2 ) * / E3 / ( ( E4 [ @A = v ] ) | ( E5 / _* / E6 ) )

*

[ ]

E1 E2 E3 E4 @A=v E5 E6

/

/

/

/

/_*/EE-Join

KC-Join

EE-Join

EE-Join

Union

EA-Join EE-Join

Join algorithms

Element – Attribute joinElement – Element joinKleene – Closure join

EA-Join Algorithm

Input: {E1..Em}: Ei is a set of elements having a common

document identifier; {A1..An}: Aj is a set of attributes having a common

document identifier;Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.

//Sort-merge {Ei} and {Aj} by document identifier.For each Ei and Aj with the same did do

//Sort-merge Ei and Aj by PARENT-CHILD relationship.For each e in Ei and a in Aj do

If ( e is a parent of a) then output (e,a);End

End.

Example

chapter chapter chapter appendix

Figure Figure Figure

book

Attribute-element position

chapter <1,3>

chapter<2,1>

name <3,0>

name <4,0>

chapter <1,3>

name<2,0>

chapter <3,1>

name <4,0>

EE-Join Algorithm

Input: {E1..Em} and {F1..Fn}: Ei and Fj is a set of elements

having a common document identifier.Output:

A set of (e,f) pairs such that the element e is an ancestor of the element f.

//Sort-merge {Ei} and {Fj} by doc. identifier.For each Ei and Fj with the same did do

//Sort-merge Ei and Fj by ANCESTOR-DESCENDANT relationship.For each e in Ei and f in Fj do

If (e is an ancestor of f ) then output (e,f)End

End

Extreme case of EE-Join

chapter <1,90>

chapter <2,80>

chapter <8,20>

chapter <9,10>

figure <10,0>

figure <11,0>

figure <19,0>

KC-Join Algorithm

Input: {E1..Em}: where Ei is a group of elements from an

XML document.Output:

A Kleene Closure of {E1..Em}//Apply EE-Join algorithm repeatedly.Set x = 1;Set Ki = {E1..Em};Repeat

Set I = I +1;Set Ki = EE-Join(Ei-1, E1);

Until ( Ki is empty);Output union of K1,K2..Ki-1.

Outline


Experiment Results

Comparison with top-down and bottom-up evaluation methods.Comparison for EE-Join ( E1 /_*/ E2 ) EA-Join ( E[@A] )

Scalability test

EE-Join performance

EA-Join performance

Results

EE-Join algorithm outperformed bottom-up.EA-Join algorithm is comparable with top-down but outperformed bottom-up.Both are linearly scalable.

indexing and querying xml data for regular path expressions

Documents

tree node y

ancestor of y

parent x

predecessor of y

tree frogsa subexpression

sibling nodes x

given nodes x

querying xml data