ripple joins for online aggregation

24
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C

Upload: chance

Post on 13-Jan-2016

41 views

Category:

Documents


1 download

DESCRIPTION

Ripple Joins for Online Aggregation. by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C. In Simple words. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ripple Joins for Online Aggregation

Ripple Joins for Online Aggregation

byPeter J. Haas and Joseph M. Hellersteinpublished in June 1999

presented byNag Prajval B.C

Page 2: Ripple Joins for Online Aggregation

In Simple words

This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.

Page 3: Ripple Joins for Online Aggregation

Motivation Traditional Join Algorithms

Nested- Loop Join Merge Join Hash Join

Main Goal: These join algorithms are designed minimize the time to completion of query.

Page 4: Ripple Joins for Online Aggregation

Online System Huge database with large tables.

Select student.honors_code,avg(enroll grade)From enroll, studentWhere enroll.sid=student.sid

Group by student.honors_code;

Page 5: Ripple Joins for Online Aggregation

Traditional Algorithms Traditional algorithms take a lot of time since

they have to process the entire tables or relations

The users have to wait for a long time before the results are returned.

Many times aggregation queries are used to get a big picture of a dataset.

More Appropriate method would be Online Aggregation.

Page 6: Ripple Joins for Online Aggregation

Online Aggregation A running estimate of the final aggregates are

continuously displayed to the user. The proximity of the running estimate to the

final result is also displayed to the user.(confidence interval)

Quick results rather than minimize time for completion.

This required some changes to be made to the traditional algorithms for query processing.

Page 7: Ripple Joins for Online Aggregation

GUI, 1999

Page 8: Ripple Joins for Online Aggregation

Join Algorithms for Online Processing Traditional Nested loop algorithm could be used in an

online fashion. Drawbacks: 1. If R was of non trivial size then time between successive

updates could be excessive. 2. consider a query like this: Select AVG(S.a+ R.b/1000000) From S and R Where S.x=R.x As scan progresses no new information is retrieved ie the

Confidence Intervals do not shrink well.

Page 9: Ripple Joins for Online Aggregation

Ripple Join: OperationAssume ripple join of relations R and S Select a random tuple r from R. Join with previously selected S tuples. Select a random tuple s from S. Join with previously selected R tuples. Join r and s.

Page 10: Ripple Joins for Online Aggregation

Example:

Page 11: Ripple Joins for Online Aggregation

Aspect Ratios

Aspect ratio: How many tuples are retrieved from each base relation per sampling step.

e.g. β1 = 1, β2 = 3, … Square Ripple Join:Samples are drawn from R

and S at the same rate. Rectangular Ripple Join:One relation is sampled

at a higher rate than another to provide shortest possible Confidence Intervals.

Page 12: Ripple Joins for Online Aggregation

Ripple Join AlgorithmFor(max=1 to infinity){ for(i=1 to max-1) if(predicate(R[i],s[max])) output(R[i],S[max]); for(i=1 to max) if(predicate(R[max],s[i])) output(R[max],S[i]);}

Page 13: Ripple Joins for Online Aggregation

Ripple Join Algorithm Generalization of Nested loop join Roles of Inner and Outer Relations continuosly

interchange during processing Cursor at S fixed at max=n whil cursor into R

loops from 1 to n-1 When cursor into R reaches value n,the cursor

into S loops from 1 to n.

Page 14: Ripple Joins for Online Aggregation

Why call this "Ripple Join"?

1. The algorithm seems to ripple out from a corner of the join.

2. Acronym: "Rectangles of Increasing Perimeter Length"

Page 15: Ripple Joins for Online Aggregation

Iterators Ripple joins can be developed using the iterator

model.

2 variants 1.Square Binary Ripple Join Iterator 2.Enhanced Ripple Join Iterator

Page 16: Ripple Joins for Online Aggregation

Square Binary Ripple Join Iterator

Page 17: Ripple Joins for Online Aggregation

Enhanced Ripple Join Iterator Drawbacks of Square Ripple join iterator are

handled 1)Non Unitary aspect ratios

2)Pipelining multiple ripple joins

Page 18: Ripple Joins for Online Aggregation

Ripple Join Variants

Block Ripple Join Index Ripple Join Hash Ripple Join

Page 19: Ripple Joins for Online Aggregation

Performance Goal: Provide efficient and accurate estimation. Join results should be returned in a such a way

that aggregates are updated regularly and confidence intervals shrink rapidly.

Depends on 2 important things: 1.Estimators for SUM,COUNT and AVG 2.Confidence intervals

Page 20: Ripple Joins for Online Aggregation

Estimators for SUM,COUNT and AVG

Consider a query of the form SELECT op(expression) from R,S WHERE predicate;

At end of nth sampling step estimator of SUM(expression) is

Page 21: Ripple Joins for Online Aggregation

Estimators for SUM,COUNT and AVG

Rn and Sn :Set of Tuples that have been read at nth sampling step

R and S :Set of Relations expressionP(r,s) equals expression(r,s) if (r,s)

satisfies where clause and 0 otherwise Count(*): expressionP(r,s) replaced by

oneP(r,s) where oneP(r,s) =1 if (r,s) satisfies where clause and 0 otherwise. AVG(expression)=SUM estimator/COUNT

estimator.

Page 22: Ripple Joins for Online Aggregation

Confidence interval A running confidence interval displays how

close this answer is to the final result. This could be calculated in many ways. The authors present an example calculation

built on extending the Central Limit Theorem.

Page 23: Ripple Joins for Online Aggregation

Ripple Joins Ripple joins are designed to minimize the time

until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval.

Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data.

Page 24: Ripple Joins for Online Aggregation

Ripple Joins Ripple joins also permit the user to dynamically trade off

the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update.

Ripple joins appear to be among the first database algorithms to use statistical information about the data not just to estimate selectivities and processing costs, but to estimate the quality of the result currently being displayed to the user and to dynamically adjust algorithm behavior accordingly.