sf-tree: an efficient and flexible structure for estimating selectivity of simple path expressions...
DESCRIPTION
MotivationTRANSCRIPT
![Page 1: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/1.jpg)
SF-Tree:An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee
Ho Wai Shing
![Page 2: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/2.jpg)
Contents Motivation Problem Definition Perfect Binary SF-Tree Variations of SF-Tree Experiments Conclusion
![Page 3: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/3.jpg)
Motivation
![Page 4: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/4.jpg)
Motivation XML is becoming more and more
popular e.g., product catalogs, invoices,
documentations, etc We need to extract useful
information out of them Queries, usually in form of XQuery
or XPath, are important
![Page 5: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/5.jpg)
Motivation
![Page 6: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/6.jpg)
Motivation e.g.,
FOR $i IN document(“*”)//buyer/name $j IN document(“*”)//seller/name WHERE $i/text()=$j/text() RETURN $i
![Page 7: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/7.jpg)
Motivation To evaluate this query, different
evaluation plans are available
FOR $i IN document(“*”)//buyer/name $j IN document(“*”)//seller/name WHERE $i/text()=$j/text() RETURN $i
![Page 8: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/8.jpg)
Motivation To select the best query plan, we
need the selectivity of simple path expressions (SPEs)
To retrieve selectivity counts, we need a structure to store suitable statistics of the original XML document collection
![Page 9: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/9.jpg)
Motivation We may use path indices to store
this SPE-to-selectivity mapping e.g., DataGuide, path tree, 1-index
etc. disadvantage: designed for absolute
path expressions (APEs) and inefficient for simple path expressions (SPEs)
![Page 10: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/10.jpg)
Motivation
![Page 11: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/11.jpg)
Motivation To increase efficiency, we may use
suffix trie to store the mapping advantage: really quick, no need to
scan the whole index tree disadvantage: very large, especially
when the data tree is complex
![Page 12: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/12.jpg)
Motivation
![Page 13: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/13.jpg)
Motivation These structures are not flexible
won’t change with frequently asked queries
can’t make trade-offs between space and speed
We can tolerate some errors in selectivity retrieval and it is good to know a bound on the error
![Page 14: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/14.jpg)
Motivation Our goal: to build a new structure for
estimating the selectivity of SPEs which, is faster than the path indices requires less space than suffix tries provides flexible trade-off between space
and time provides accuracy guarantee on the
estimated selectivity
![Page 15: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/15.jpg)
Problem Definition
![Page 16: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/16.jpg)
Problem Definition XML document
is a text document that logically represents a directed edge-labeled graph called data graph
consists of Elements, Attributes and other information items
![Page 17: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/17.jpg)
Problem Definition
![Page 18: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/18.jpg)
Problem Definition Data Paths
a set of information items (e1, e2, …, en)
each ei is directly nested within ei-1 for
2 i n i.e., a path in a data graph (not
necessarily starts from root)
![Page 19: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/19.jpg)
Problem Definition Simple Path Expression (SPE)
asks about the structure of XML documents
has a form: //t1/t2/…/tn where ti are tags
returns all information items (en) which is the end of a data path (e1, e2, …, en) that satisfies the SPE
![Page 20: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/20.jpg)
Problem Definition Satisfy: a data path (e1, e2, …, en)
satisfies an SPE (//t1/t2/…/tn) if: every ei has a tag ti
Selectivity (or counts, selectivity counts) selectivity of an SPE (//t1/t2/…/tn) is the
number of information items en’s that are returned by the SPE
![Page 21: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/21.jpg)
Problem Definition e.g.,
//buyer/name, //name, //invoice/buyer/name are all SPEs
selectivity counts of //name is 2, //products is 1, //invoice is 1
![Page 22: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/22.jpg)
Problem Definition
![Page 23: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/23.jpg)
Problem Definition The Problem:
given a simple path expression p and a collection D of XML documents,
find the estimated selectivity s1 of p in D while s1 should be as close to the actual selectivity s of p in D as possible
![Page 24: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/24.jpg)
SF-Tree
![Page 25: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/25.jpg)
SF-Tree Basic Idea:
divide all possible SPEs in an XML document collection into distinct groups
each SPE in a group has the same selectivity (or similar selectivity)
the task of finding the selectivity of an SPE now becomes the task of finding which selectivity group this SPE belongs to
![Page 26: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/26.jpg)
SF-Tree e.g.,
![Page 27: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/27.jpg)
SF-Tree e.g.,
![Page 28: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/28.jpg)
SF-Tree In each group, we use a signature
file to store all the paths that belong to this group
This reduces the storage requirement, improves the accessing speed but induces errors to the structure
![Page 29: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/29.jpg)
Signature Files A traditional technique to improve
the efficiency of keyword-based information retrieval
Each keyword is hashed into m bit positions of a signature file F
For each keyword in the document, the corresponding m bits in F are set
![Page 30: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/30.jpg)
Signature Files e.g.,
F has 10 bits, m = 3 //name is hashed to bits 2, 3, 8 //buyer is hashed to bits 2, 4, 9
0 1 1 1 0 0 0 1 1 0 1 2 3 4 5 6 7 8 9 10
![Page 31: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/31.jpg)
Signature Files To check the existence of a query
keyword k in F, k is hashed and we check whether all the m bits are set or not
If any of the m bits are unset,k is not in F
If all m bits are set, k is very likely to be in F
![Page 32: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/32.jpg)
Signature Files e.g., (cont)
if a query keyword //seller is hashed to bits 2, 8, 9,
F will say //seller is in it, but it is not true
bits 2, 8, 9 are in fact set by other keywords
a false drop occurred
![Page 33: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/33.jpg)
Signature Files Previous analysis showed that in
optimal case, false drop rate is (1/2)m
|F | = m |D | / ln(2) where D is the set of all keywords in the document
In our case, we treat an SPE in a group as a keyword and a group as a document
![Page 34: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/34.jpg)
SF-Tree There are many groups
for 16-bit selectivity counts, there are 216 groups if each group has a distinct selectivity count
To get the selectivity, we need to find which group this SPE belongs to
Linear scan of all groups is not feasible slow error prone
![Page 35: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/35.jpg)
SF-Tree organize the groups into a tree form called an SF-Tree (stands for
Signature File Tree) Rationale: create representative
groups over several groups, and if an SPE is not in a representative group, it will never be in any of descendant groups of this representative group.
![Page 36: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/36.jpg)
SF-Tree
![Page 37: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/37.jpg)
SF-Tree
![Page 38: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/38.jpg)
SF-Tree Basic selectivity retrieval algorithm:
1. start from the root, 2. check which child contains the SPE p 3. descend to that child, repeat step 2,
3 until a leaf is reached 4. return the selectivity associated with
that leaf
![Page 39: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/39.jpg)
Perfect Binary SF-Tree A very straightforward way to build
representative nodes is to build the nodes according to the most significant bits of the selectivity count
i.e., all the descendants has the same m.s.b. in their selectivity counts
A perfect binary SF-Tree is thus formed
![Page 40: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/40.jpg)
Perfect Binary SF-Tree advantage:
easy to build easy to analyze
disadvantage: less flexible cannot tackle FAQ not that space efficient
![Page 41: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/41.jpg)
Perfect Binary SF-Tree properties: (proofs are skipped) low error rate:
let d be depth of the SF-Tree, a be the depth difference between the false drop leaf, correct leaf and their least common ancestor
for negative queries, false drop rate is:Rneg(d) < 1/2d(m-1)
for positive queries, max error size = 2a, Rpos(d, a) < 1/2ma-a+1
![Page 42: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/42.jpg)
Perfect Binary SF-Tree high accessing speed:
assuming no false drop, Tneg(d) = 2 signatures Tpos(d) = 2d signatures it doesn’t depend on the complexity
of the XML documents
![Page 43: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/43.jpg)
Perfect Binary SF-Tree each false drop induces at most 2
more signature checks expected number of signatures to be
checked for positive query: C < 2d + (2d+2)(d+2)/2m
![Page 44: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/44.jpg)
Perfect Binary SF-Tree storage:
each SPE must appear at one leaf and all the ancestors of that leaf
each SPE contributes m/ln2 bits to each signature that contains it
space requirement = (m/ln2) v (|v|dv) where v is a leaf node
![Page 45: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/45.jpg)
Perfect Binary SF-Tree properties -- summary:
negligible error rate (if m = 10, d = 16, error rate for a negative query is 1/2144!)
error rate decreases as error size increases for positive queries
error rate and accessing time are independent of the complexity of the XML documents
![Page 46: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/46.jpg)
Variations of SF-Tree
![Page 47: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/47.jpg)
Variations of SF-Tree drawbacks of a perfect binary SF-
Tree: every SPE has the same depth in SF-
Tree, cannot optimize for FAQ too many layers, we can tolerate a
higher error rate for smaller space requirement
![Page 48: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/48.jpg)
Variations of SF-Tree Huffman SF-Tree Shannon-Fano SF-Tree Cropped SF-Tree
![Page 49: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/49.jpg)
Huffman SF-Tree build a Huffman tree on the groups
of selectivity the weight of each group can be:
A. the number of SPEs in that group B. the probability of a random query
will fall into that group
![Page 50: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/50.jpg)
Huffman SF-Tree
![Page 51: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/51.jpg)
Huffman SF-Tree advantage:
optimal in space required over all binary SF-Trees if the weight is (A).
optimal in expected number of signatures to be checked if the weight is (B).
disadvantage: error rate is independent of error size
![Page 52: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/52.jpg)
Shannon-Fano SF-Tree build a Shannon-Fano tree over the
selectivity groups the order of the leaves are not
changed similar depth as in Huffman tree error rate now decreases as error
size increases
![Page 53: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/53.jpg)
Shannon-Fano SF-Tree We observed that the distribution of
SPE over all selectivity is very skewed: most of the SPE has a very small
selectivity the distribution is similar to a GP with ratio
smaller than 1. The use of Huffman or Shannon-Fano
tree much reduces the depth of SF-Tree
![Page 54: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/54.jpg)
Shannon-Fano SF-Tree
0
100
200
300
400
500
600
700
800
900
64 128 192 256 320 384 448 512 576
selectivity
num
ber o
f SPE
s
![Page 55: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/55.jpg)
Cropped SF-Tree Cropped SF-Tree is an easy trade-
off between access time, space and accuracy
two types of cropped SF-Tree: head-cropped SF-Tree (the top h
levels of an SF-Tree is removed) tail-cropped SF-Tree (the bottom t
levels of an SF-Tree is removed)
![Page 56: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/56.jpg)
Cropped SF-Tree
original SF-Tree
![Page 57: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/57.jpg)
Head-cropped SF-Tree height of the SF-Tree decreases space required is smaller accessing time increases because
we have to check all the children of the root in the first step
accuracy decreases a little bit because the maximum possible value of a decreases.
![Page 58: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/58.jpg)
Tail-cropped SF-Tree height of the SF-Tree decreases space required is smaller accessing time is shorter (we can
traverse fewer levels) accuracy is lowered (a leaf now
represents a range of selectivity, i.e., the quantization step size of selectivity is larger)
![Page 59: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/59.jpg)
Experiments
![Page 60: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/60.jpg)
Experiments Data Sets
![Page 61: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/61.jpg)
Experiments Query Workloads
positive SPEs: randomly pick 2000 SPEs from all the SPEs in the XML document collection. Each SPE has equal probability of being picked
negative SPEs: randomly concatenate tags in the XML document collection, and then remove those with non-zero selectivity until 2000 SPEs are generated
![Page 62: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/62.jpg)
Experiments
![Page 63: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/63.jpg)
Comparison to Competitors
![Page 64: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/64.jpg)
Performance of SF-Tree
![Page 65: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/65.jpg)
Performance of SF-Tree
![Page 66: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/66.jpg)
Performance of SF-Tree
![Page 67: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/67.jpg)
Conclusions
![Page 68: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/68.jpg)
Conclusions A new structure, SF-Tree, is
introduced SF-Tree is much quicker and path
indices like path trees and uses much less space than suffix trie
SF-Tree provides means to make a trade-off between space, time and accuracy
![Page 69: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/69.jpg)
Conclusions Although SF-Tree returns only
estimated selectivity, but accuracy is guaranteed Neg queries: false drop rate = 1/2d(m-
1), Pos queries: false drop rate = 1/2ma-a+1
max error size = 1/2a
![Page 70: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/70.jpg)
Discussions SF-Tree is a new approach to
approximate some "string-to-number" structures
It may be applicable to other aspects such as approximating a multi-dimensional histogram
![Page 71: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/71.jpg)
Questions?
![Page 72: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/72.jpg)
References A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Estimating
the selectivity of XML path expressions for internet scale applications. In VLDB, 591--600, 2001.
C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and its analytical performance evaluation. ACM TOIS, 2(4):267--288, 1984.
R. Goldman and J. Widom. Data Guides: Enabling query formulation and optimization in semistructured databases. In VLDB, pp. 436--445, 1997.
L. Lim, M. wang, S. Padmanabhan, J. S. Vitter, and R.Parr. XPathLearner: an on-line self-tuning markov histogram for XML path selectivity estimation. In VLDB 2002.
T. Milo, D. Suciu. Index structures for path expressions. In ICDT, 1999.
N. Polyzotis and M. Garofalakis. Statistical synopses for graph-structured XML databases. In SIGMOD, 2002.
![Page 73: SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing](https://reader036.vdocuments.us/reader036/viewer/2022062905/5a4d1acc7f8b9ab05996fc3c/html5/thumbnails/73.jpg)
References N. Polyzotis and M. Garofalakis. Structure and value synopses
for XML data graphs. In VLDB, 2002. W3C. Extensible markup language (XML) 1.0.
http://www.w3.org/TR/1998/REC-xml-19980210