top- k query evaluation with probabilistic guarantees martin theobald, gerhard weikum, ralf schenkel...
Post on 22-Dec-2015
213 views
TRANSCRIPT
![Page 1: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/1.jpg)
Top-K Query Evaluation with Probabilistic GuaranteesMartin Theobald, Gerhard Weikum, Ralf Schenkel
Presenter: Avinandan Sengupta
![Page 2: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/2.jpg)
2
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
Presentation Outline
![Page 3: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/3.jpg)
3
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 4: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/4.jpg)
4
Data and a Query
Scrip ID Earnings Per Share
P/Eratio
β ... Average MarketCap (B$)
SNPS 1.27 17.63 0.69 ... 3.27
IBM 12.28 13.85 0.72 ... 200
... … … … ... ...
INFY 2.72 19.51 1.17 30.4
MSFT 2.70 9.32 1.03 210
GOOG 27.73 19.33 1.13 173
Top 10 midcap
stocks with low β
Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance
Attributes
Objects
![Page 5: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/5.jpg)
5
P/ERatio
(norm)
INFY: 1
GOOG: 0.99
SNPS: 0.90
IBM: 0.70
...
MSFT: 0.47
β-1
(norm)
SNPS: 1
IBM: 0.96
MSFT: 0.67
GOOG: 0.61
...
INFY: 0.59
Average MarketCap (B$)
SNPS: 1
INFY : 0.80
...
GOOG: 0.05
IBM: 0.07
MSFT: 0.08
PEj/Highest PE (β-1j /max(β-1
j)) Grades based on how close the market cap is to the midcap median; normalized
Midcap median 4.5B≅
Hypothetical Graded Lists(made fit for consumption by Top-k processors)
f = 0.5*P/E + 1.0*β-1 + 1.0*MCap
weights
Aggregate function
normalization
![Page 6: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/6.jpg)
6
Top-kList
SNPS, X
INFY, Y
...
GOOG, Z
Top-k resultsP/E
Ratio(norm)
INFY: 1
GOOG: 0.99
SNPS: 0.90
IBM: 0.70
...
MSFT: 0.47
β-1
(norm)
SNPS: 1
IBM: 0.96
MSFT: 0.67
GOOG: 0.61
...
INFY: 0.59
Average MarketCap (B$)
SNPS: 1
INFY : 0.80
...
GOOG: 0.05
IBM: 0.07
MSFT: 0.08
Top-k Processor
![Page 7: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/7.jpg)
7
Presentation Outline
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 8: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/8.jpg)
8
Fagin’s Threshold Algorithm (TA)
• Access the n lists in parallel.• As an object oi is seen, perform a random access
to the other lists to find the complete score for oi.
• Do the same for all objects in the current row.• Now compute the threshold τ as the sum of
scores in the current row.• The algorithm stops after k objects have been
found with a score above τ.
![Page 9: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/9.jpg)
9
TA with No Random Access (TA-NRA)
• Access the n lists in parallel.• For an item a, compute its (B)est score:
Ba = f { f {scorej | j ∈ seen-attributes(a)},
f {highk | k ∉ seen-attributes(a)}}
highk = last seen score for the kth attribute
and its (W)orst scoreWa = f { f {scorej | j ∈ seen-attributes(a)},
f {0 | k ∉ seen-attributes(a)}}
• Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose Bm ≥ Wk – this means that we also maintain a table of all seen objects with their W/B
scores
Top-kList
SNPS, W1, B1
INFY, W2, B2
...
GOOG, Wk, Bk
Running Top-k list; contains the k objectswith largest W values; ties broken with B values
![Page 10: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/10.jpg)
10
Issues with TA and TA-NRA
• High space-time costs• Overly conservative
![Page 11: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/11.jpg)
11
Presentation Outline
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 12: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/12.jpg)
12
Are we solving the right problem?
• Is random access possible in most common scenarios?– Web content– XML data, hierarchical data sets
• Does the user need an exact top-k query result?– Or is she satisfied with an approximation?
![Page 13: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/13.jpg)
13
How about an approximate solution?
• Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process?– Quickly reach solution
![Page 14: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/14.jpg)
14
Pictorially...
Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
![Page 15: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/15.jpg)
15
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 16: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/16.jpg)
16
Probabilistic TA-NRA - 1
• Predict the total score of a item for which a partial score is known
• Avoid the overly conservative best-score/worst-score bounds of the original TA-NRA– Instead, calculate the probability that the total
score of the item exceeds a threshold (making the item interesting for the top-k result)
![Page 17: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/17.jpg)
17
Probabilistic TA-NRA - 2
• If this probability is sufficiently low (below a threshold), drop the item from the candidate list.
• The probabilistic prediction involves computing the convolution of the score distributions of different index lists.
![Page 18: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/18.jpg)
18
Score Distribution of Lists - How?
β-1
(norm)
SNPS: 1
IBM: 0.96
MSFT: 0.67
GOOG: 0.61
...
INFY: 0.59
score0.59 1.0
Median 0.65
Parameter fitting curve fitting
12
3
![Page 19: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/19.jpg)
19
What it is and What it is not
• Probabilistic guarantees are not about query run-times but about query result quality
• Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set
![Page 20: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/20.jpg)
20
The Math
Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
Set of seen attributes for
an object
![Page 21: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/21.jpg)
21
More Math...
Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
![Page 22: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/22.jpg)
22
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 23: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/23.jpg)
23
What distributions to consider?
• Uniform distribution– simplest assumptions– convolutions based on moment-generating functions
with generalized Chernoff-Hoeffding bounds• Poisson estimations– efficiently evaluated, provides a reasonable fit for tf*idf
based score distributions for Web corpora• Histograms– when above methods fail– Involves non-trivial computation (done offline per list)
![Page 24: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/24.jpg)
24
Solving Convolutions? Difficult
• When the PDF is a uniform distribution, its solution becomes difficult– Use alternate techniques other than convolution– Off-load computation to available probabilistic
engines – OpenMaple, etc
![Page 25: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/25.jpg)
25
Queue Management
Source: http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf (author’s webpage)
![Page 26: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/26.jpg)
26
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 27: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/27.jpg)
27
Results
Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
![Page 28: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/28.jpg)
28
Performance as a function of ε
Source: Paper
![Page 29: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/29.jpg)
29
Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores
Source: Paper
![Page 30: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/30.jpg)
30
• Introduction to Top-k query processing• The threshold algorithm and its variants• Are we solving the right problem?• A probabilistic algorithm• Implementation Details• Results• Conclusion
![Page 31: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/31.jpg)
31
• New algorithms were developed based on probabilistic score predictions– Trade-off a small amount of top-k result quality for a
drastic reduction of sorted accesses• Intelligent management of priority queues for
efficient implementation was presented• Assumptions were made regarding the aggregation
function to be summation• Future work to be based on ranked retrieval of XML
data and integrating into XXL search engine
Conclusion
![Page 32: Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta](https://reader033.vdocuments.us/reader033/viewer/2022050714/56649d775503460f94a586d5/html5/thumbnails/32.jpg)
32
Thanks!