ripple joins for online aggregation

Ripple Joins for Online Aggregation

byPeter J. Haas and Joseph M. Hellersteinpublished in June 1999

presented byNag Prajval B.C

In Simple words

This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.

Motivation Traditional Join Algorithms

Nested- Loop Join Merge Join Hash Join

Main Goal: These join algorithms are designed minimize the time to completion of query.

Online System Huge database with large tables.

Select student.honors_code,avg(enroll grade)From enroll, studentWhere enroll.sid=student.sid

Group by student.honors_code;

Traditional Algorithms Traditional algorithms take a lot of time since

they have to process the entire tables or relations

The users have to wait for a long time before the results are returned.

Many times aggregation queries are used to get a big picture of a dataset.

More Appropriate method would be Online Aggregation.

Online Aggregation A running estimate of the final aggregates are

continuously displayed to the user. The proximity of the running estimate to the

final result is also displayed to the user.(confidence interval)

Quick results rather than minimize time for completion.

This required some changes to be made to the traditional algorithms for query processing.

GUI, 1999

Join Algorithms for Online Processing Traditional Nested loop algorithm could be used in an

online fashion. Drawbacks: 1. If R was of non trivial size then time between successive

updates could be excessive. 2. consider a query like this: Select AVG(S.a+ R.b/1000000) From S and R Where S.x=R.x As scan progresses no new information is retrieved ie the

Confidence Intervals do not shrink well.

Ripple Join: OperationAssume ripple join of relations R and S Select a random tuple r from R. Join with previously selected S tuples. Select a random tuple s from S. Join with previously selected R tuples. Join r and s.

Example:

Aspect Ratios

Aspect ratio: How many tuples are retrieved from each base relation per sampling step.

e.g. β1 = 1, β2 = 3, … Square Ripple Join:Samples are drawn from R

and S at the same rate. Rectangular Ripple Join:One relation is sampled

at a higher rate than another to provide shortest possible Confidence Intervals.

Ripple Join AlgorithmFor(max=1 to infinity){ for(i=1 to max-1) if(predicate(R[i],s[max])) output(R[i],S[max]); for(i=1 to max) if(predicate(R[max],s[i])) output(R[max],S[i]);}

Ripple Join Algorithm Generalization of Nested loop join Roles of Inner and Outer Relations continuosly

interchange during processing Cursor at S fixed at max=n whil cursor into R

loops from 1 to n-1 When cursor into R reaches value n,the cursor

into S loops from 1 to n.

Why call this "Ripple Join"?

1. The algorithm seems to ripple out from a corner of the join.

2. Acronym: "Rectangles of Increasing Perimeter Length"

Iterators Ripple joins can be developed using the iterator

model.

2 variants 1.Square Binary Ripple Join Iterator 2.Enhanced Ripple Join Iterator

Square Binary Ripple Join Iterator

Enhanced Ripple Join Iterator Drawbacks of Square Ripple join iterator are

handled 1)Non Unitary aspect ratios

2)Pipelining multiple ripple joins

Ripple Join Variants

Block Ripple Join Index Ripple Join Hash Ripple Join

Performance Goal: Provide efficient and accurate estimation. Join results should be returned in a such a way

that aggregates are updated regularly and confidence intervals shrink rapidly.

Depends on 2 important things: 1.Estimators for SUM,COUNT and AVG 2.Confidence intervals

Estimators for SUM,COUNT and AVG

Consider a query of the form SELECT op(expression) from R,S WHERE predicate;

At end of nth sampling step estimator of SUM(expression) is

Estimators for SUM,COUNT and AVG

Rn and Sn :Set of Tuples that have been read at nth sampling step

R and S :Set of Relations expressionP(r,s) equals expression(r,s) if (r,s)

satisfies where clause and 0 otherwise Count(*): expressionP(r,s) replaced by

oneP(r,s) where oneP(r,s) =1 if (r,s) satisfies where clause and 0 otherwise. AVG(expression)=SUM estimator/COUNT

estimator.

Confidence interval A running confidence interval displays how

close this answer is to the final result. This could be calculated in many ways. The authors present an example calculation

built on extending the Central Limit Theorem.

Ripple Joins Ripple joins are designed to minimize the time

until an acceptably precise estimate of the query result is available, as measured by the length of a confidence interval.

Ripple joins are adaptive, adjusting their behavior during processing in accordance with the statistical properties of the data.

Ripple Joins Ripple joins also permit the user to dynamically trade off

the two key performance factors of online aggregation: the time between successive updates of the running aggregate, and the amount by which the confidence-interval length decreases at each update.

Ripple joins appear to be among the first database algorithms to use statistical information about the data not just to estimate selectivities and processing costs, but to estimate the quality of the result currently being displayed to the user and to dynamically adjust algorithm behavior accordingly.

ripple joins for online aggregation

Documents