ripple joins for online aggregation
DESCRIPTION
Ripple Joins for Online Aggregation. by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C. In Simple words. - PowerPoint PPT PresentationTRANSCRIPT
Ripple Joins for Online Aggregation
byPeter J. Haas and Joseph M. Hellersteinpublished in June 1999
presented byNag Prajval B.C
In Simple words
This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.
Motivation Traditional Join Algorithms
Nested- Loop Join Merge Join Hash Join
Main Goal: These join algorithms are designed minimize the time to completion of query.
Online System Huge database with large tables.
Select student.honors_code,avg(enroll grade)From enroll, studentWhere enroll.sid=student.sid
Group by student.honors_code;
Traditional Algorithms Traditional algorithms take a lot of time since
they have to process the entire tables or relations
The users have to wait for a long time before the results are returned.
Many times aggregation queries are used to get a big picture of a dataset.
More Appropriate method would be Online Aggregation.
Online Aggregation A running estimate of the final aggregates are
continuously displayed to the user. The proximity of the running estimate to the
final result is also displayed to the user.(confidence interval)
Quick results rather than minimize time for completion.
This required some changes to be made to the traditional algorithms for query processing.
GUI, 1999
Join Algorithms for Online Processing Traditional Nested loop algorithm could be used in an
online fashion. Drawbacks: 1. If R was of non trivial size then time between successive
updates could be excessive. 2. consider a query like this: Select AVG(S.a+ R.b/1000000) From S and R Where S.x=R.x As scan progresses no new information is retrieved ie the
Confidence Intervals do not shrink well.
Ripple Join: OperationAssume ripple join of relations R and S Select a random tuple r from R. Join with previously selected S tuples. Select a random tuple s from S. Join with previously selected R tuples. Join r and s.
Example:
Aspect Ratios
Aspect ratio: How many tuples are retrieved from each base relation per sampling step.
e.g. β1 = 1, β2 = 3, … Square Ripple Join:Samples are drawn from R
and S at the same rate. Rectangular Ripple Join:One relation is sampled
at a higher rate than another to provide shortest possible Confidence Intervals.
Ripple Join AlgorithmFor(max=1 to infinity){ for(i=1 to max-1) if(predicate(R[i],s[max])) output(R[i],S[max]); for(i=1 to max) if(predicate(R[max],s[i])) output(R[max],S[i]);}
Ripple Join Algorithm Generalization of Nested loop join Roles of Inner and Outer Relations continuosly
interchange during processing Cursor at S fixed at max=n whil cursor into R
loops from 1 to n-1 When cursor into R reaches value n,the cursor
into S loops from 1 to n.
Why call this "Ripple Join"?
1. The algorithm seems to ripple out from a corner of the join.
2. Acronym: "Rectangles of Increasing Perimeter Length"
Iterators Ripple joins can be developed using the iterator
model.
2 variants 1.Square Binary Ripple Join Iterator 2.Enhanced Ripple Join Iterator
Square Binary Ripple Join Iterator
Enhanced Ripple Join Iterator Drawbacks of Square Ripple join iterator are
handled 1)Non Unitary aspect ratios
2)Pipelining multiple ripple joins
Ripple Join Variants
Block Ripple Join Index Ripple Join Hash Ripple Join
Performance Goal: Provide efficient and accurate estimation. Join results should be returned in a such a way
that aggregates are updated regularly and confidence intervals shrink rapidly.
Depends on 2 important things: 1.Estimators for SUM,COUNT and AVG 2.Confidence intervals
Estimators for SUM,COUNT and AVG
Consider a query of the form SELECT op(expression) from R,S WHERE predicate;
At end of nth sampling step estimator of SUM(expression) is
Estimators for SUM,COUNT and AVG
Rn and Sn :Set of Tuples that have been read at nth sampling step
R and S :Set of Relations expressionP(r,s) equals expression(r,s) if (r,s)
satisfies where clause and 0 otherwise Count(*): expressionP(r,s) replaced by
oneP(r,s) where oneP(r,s) =1 if (r,s) satisfies where clause and 0 otherwise. AVG(expression)=SUM estimator/COUNT
estimator.
Confidence interval A running confidence interval displays how
close this answer is to the final result. This could be calculated in many ways. The authors present an example calculation
built on extending the Central Limit Theorem.
Ripple Joins Ripple joins are designed to minimize the time
until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval.
Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data.
Ripple Joins Ripple joins also permit the user to dynamically trade off
the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update.
Ripple joins appear to be among the first database algorithms to use statistical information about the data not just to estimate selectivities and processing costs, but to estimate the quality of the result currently being displayed to the user and to dynamically adjust algorithm behavior accordingly.