Download - XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey

XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation

Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, Ronald Parr

Speaker: Ho Wai Shing

Contents Introduction: the problems in XML

path selectivity estimation XPathLearner: the properties and

the details Experiment Results Conclusions Future Work

Introduction XML is becoming the standard of

data exchange We need to query the structure

and text data of XML documents Selectivity is essential in

optimizing evaluation plans

Introduction Example:

Introduction Example:

FOR $b IN document("*")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998"RETURN $b/title

The path expressions:

//book/publisher = "Morgan Kaufmann"//book/year = "1998"//book/title

Introduction We need a structure to store some

statistics of the data Then calculate the estimated

selectivity from these statistics Problem: estimate the selectivity

of (simple, single-value, multi-value) path expressions with limited space

Related Work Path Trees Markov Tables k-RO (in Lore)

Path Trees Aggregate siblings with the same tag tag names only (no data values) e.g.,

Markov Table selectivity of short paths up to

length k is stored selectivity of longer paths are

estimated using a Markov model e.g., //DBLP 1 //article/author 1

//DBLP/book 2 //year 3

//author 4 //article/year 1

//book/author 3 //book/year 2

k-RO used in Lore systems very similar to Markov table data values are also objects stored as a graph

Twigs can answer "twig" queries

a structural query with a small branch based on suffix tree (for simple

paths) + signatures in each node (for estimating branching)

Problems Faced Offline

need to scan the whole repository beforehand to gather statistics

unfeasible if the data is remote and is extremely large

Can solve SPEs only or it's too large

Ignore data values

Problems Faced Not Adaptive to query workload

much space wasted in infrequently asked paths

No Quick Update needs periodic rescan of repository

Objective XPathLearner:

uses Markov based approach, uses an online algorithm, is adaptive to workload, can answer simple paths, single-value paths

(//A/B='3') and multi-value paths (//A='2'/B='3').

considers data values, can be easily updated

XPathLearner

Architecture

A More Detailed Example

What to Store? Markov table (1st order in the

discussion)

What to Store? may be large if there are many

data values solution: only "tag-tag", "tag", and

top-k value entries are stored exactly, other entries are stored within buckets

default is 1

What is Actually Stored? Compressed 1st order Markov

table (or, Markov histogram)A 1B 6C 7D 7

A B 6A C 3B C 4B D 1C D 6

D v3 3

tag feat sum #pairsB a 1 1B b 1 1D a 2 2D b 2 2C a 1 1C b 1 1

assumption: v1-v4 starts with 'a', v5-v8 starts with 'b',k = 1

Use this formula

: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags N: total number of data items

How to Retrieve Selectivity?

Use this formula (it's what we calculate)

: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags f(p): frequency of the path p


Use this formula (if it's multi-valued)

: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags f(t,v): frequency of the value v in tag t


Retrieval Example for path //B/C/D, estimated selectivity

=

for path //B/C/D=v3, estimated selectivity

=

=

How to Update? get the query feedback, e.g., (BCD, 5) update the histogram entries that

contained in the query so that the future estimation could be more accurate

e.g., update B, C, D, BC, BD so that the estimation is nearer to 5 than before.

two update approaches: the Heavy-tail Rule, the Delta Rule

Heavy Tail Rule put more correction towards the end

(tail) of the path equation:

fk() refers to the frequency before update fk+1() refers to the frequency after update suggestion: wi = 2i

Heavy Tail Rule updating those one-'tag' entries

safeguards the terms that were set by exact query feedback

Heavy Tail Rule A reminder to what is stored

A 1B 6C 7D 7

A B 6A C 3B C 4B D 1C D 6

D v3 3

tag feat sum #pairsB a 1 1B b 1 1D a 2 2D b 2 2C a 1 1C b 1 1

Heavy Tail Rule Example: query feedback = (ACD,

6) by the table, estimation

= f(AC) / f(C) x f(CD) = 3 / 7 x 6 3

Heavy Tail Rule updates:

new estimation = 4 / 8 x 8 = 4

Delta Rule first proposed by Rumelhart et al.

basic idea:

where

Experiments

Experiments Data Set: DBLP (other experiments

are done but not included in the paper)

Metric: average absolute error, average relative error

Experiments

Conclusions XPathLearner is a new method for

estimating the selectivity of path expressions

It is online, based on query feedback and doesn't need database scan

use Markov histograms to store statistics

Future Work change from fixed length Markov

table to variable length Markov table choose the paths to be stored more

carefully or wisely apply the update method to other

areas, e.g., graph based structures, to answer branching queries, etc

References[1]Lipyeow Lim, Min Wang, Sriram Padmanabhan,

Jeffrey Scott Vitter, Ronald Parr, XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation, VLDB'02

Download - XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey

Top Related