XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation
Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, Ronald Parr
Speaker: Ho Wai Shing
Contents Introduction: the problems in XML
path selectivity estimation XPathLearner: the properties and
the details Experiment Results Conclusions Future Work
Introduction XML is becoming the standard of
data exchange We need to query the structure
and text data of XML documents Selectivity is essential in
optimizing evaluation plans
Introduction Example:
Introduction Example:
FOR $b IN document("*")//bookWHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998"RETURN $b/title
The path expressions:
//book/publisher = "Morgan Kaufmann"//book/year = "1998"//book/title
Introduction We need a structure to store some
statistics of the data Then calculate the estimated
selectivity from these statistics Problem: estimate the selectivity
of (simple, single-value, multi-value) path expressions with limited space
Related Work Path Trees Markov Tables k-RO (in Lore)
Path Trees Aggregate siblings with the same tag tag names only (no data values) e.g.,
Markov Table selectivity of short paths up to
length k is stored selectivity of longer paths are
estimated using a Markov model e.g., //DBLP 1 //article/author 1
//DBLP/book 2 //year 3
//author 4 //article/year 1
//book/author 3 //book/year 2
k-RO used in Lore systems very similar to Markov table data values are also objects stored as a graph
Twigs can answer "twig" queries
a structural query with a small branch based on suffix tree (for simple
paths) + signatures in each node (for estimating branching)
Problems Faced Offline
need to scan the whole repository beforehand to gather statistics
unfeasible if the data is remote and is extremely large
Can solve SPEs only or it's too large
Ignore data values
Problems Faced Not Adaptive to query workload
much space wasted in infrequently asked paths
No Quick Update needs periodic rescan of repository
Objective XPathLearner:
uses Markov based approach, uses an online algorithm, is adaptive to workload, can answer simple paths, single-value paths
(//A/B='3') and multi-value paths (//A='2'/B='3').
considers data values, can be easily updated
XPathLearner
Architecture
A More Detailed Example
What to Store? Markov table (1st order in the
discussion)
What to Store? may be large if there are many
data values solution: only "tag-tag", "tag", and
top-k value entries are stored exactly, other entries are stored within buckets
default is 1
What is Actually Stored? Compressed 1st order Markov
table (or, Markov histogram)A 1B 6C 7D 7
A B 6A C 3B C 4B D 1C D 6
D v3 3
tag feat sum #pairsB a 1 1B b 1 1D a 2 2D b 2 2C a 1 1C b 1 1
assumption: v1-v4 starts with 'a', v5-v8 starts with 'b',k = 1
Use this formula
: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags N: total number of data items
How to Retrieve Selectivity?
Use this formula (it's what we calculate)
: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags f(p): frequency of the path p
How to Retrieve Selectivity?
Use this formula (if it's multi-valued)
: selectivity t1, t2, ..., tn: tags t1t2...tn: path with these tags f(t,v): frequency of the value v in tag t
How to Retrieve Selectivity?
Retrieval Example for path //B/C/D, estimated selectivity
=
for path //B/C/D=v3, estimated selectivity
=
=
How to Update? get the query feedback, e.g., (BCD, 5) update the histogram entries that
contained in the query so that the future estimation could be more accurate
e.g., update B, C, D, BC, BD so that the estimation is nearer to 5 than before.
two update approaches: the Heavy-tail Rule, the Delta Rule
Heavy Tail Rule put more correction towards the end
(tail) of the path equation:
fk() refers to the frequency before update fk+1() refers to the frequency after update suggestion: wi = 2i
Heavy Tail Rule updating those one-'tag' entries
safeguards the terms that were set by exact query feedback
Heavy Tail Rule A reminder to what is stored
A 1B 6C 7D 7
A B 6A C 3B C 4B D 1C D 6
D v3 3
tag feat sum #pairsB a 1 1B b 1 1D a 2 2D b 2 2C a 1 1C b 1 1
Heavy Tail Rule Example: query feedback = (ACD,
6) by the table, estimation
= f(AC) / f(C) x f(CD) = 3 / 7 x 6 3
Heavy Tail Rule updates:
new estimation = 4 / 8 x 8 = 4
Delta Rule first proposed by Rumelhart et al.
basic idea:
where
Experiments
Experiments Data Set: DBLP (other experiments
are done but not included in the paper)
Metric: average absolute error, average relative error
Experiments
Experiments
Experiments
Experiments
Experiments
Conclusions XPathLearner is a new method for
estimating the selectivity of path expressions
It is online, based on query feedback and doesn't need database scan
use Markov histograms to store statistics
Future Work change from fixed length Markov
table to variable length Markov table choose the paths to be stored more
carefully or wisely apply the update method to other
areas, e.g., graph based structures, to answer branching queries, etc
References[1]Lipyeow Lim, Min Wang, Sriram Padmanabhan,
Jeffrey Scott Vitter, Ronald Parr, XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation, VLDB'02