Title: Ripple Joins for Online Aggregation
1Ripple Joins for Online Aggregation
- by
- Peter J. Haas and Joseph M. Hellerstein
- published in June 1999
- presented by
- Ronda Hilton
2Overview
- This paper tells how to join a bunch of tables
- and get the SUM, COUNT, or AVG
- in GROUP BY clauses
- showing approximate results immediately
- and the confidence interval of the results
- from the first few tuples retrieved
- updating a GUI display with closer approximation
information as the join adds more tuples.
3Ripple joins compared to our previous topics
- General research area algorithms
- another approximation algorithm
- online processing
- not maintaining a sample set
- aggregate queries joins, and group-by
- requires random retrieval
- uses probabilistic calculations to determine the
quality of the approximate result - not optimizing
- implemented as middleware on the DBMS
4Traditional Hash Join stores the smaller relation
in memory
- Two relations R and S
- with a common attribute
- on each distinct value of that attribute,
- match up the tuples which have the same value.
- Example
- select R.roomnumber,
- COUNT(S.homeroom)
- from Rooms R
- join Student S on
- R.roomnumberS.homeroom
- For each tuple r in R
- add hash(roomnumber) to the hashtable in memory
- if hashtable has filled up memory
- for every tuple s in S
- if hash(homeroom) is found in the hashtable
- add tuple r and tuple s to the output
- reset the hashtable
- Finally, scan S and add the resulting join tuples
to the output.
5What's different about ripple join?
- Traditional hash join blocks until the entire
query output is finished. - Ripple join reports approximate results after
each sampling step, and allows user intervention. - In the inner loop, an entire table is scanned.
- Ripple join expands the sample set incrementally.
6The most important difference
- The tuples are processed in random order.
7Pipelining
- In pipelining join algorithms, as the join
progresses, more and more information gets added
to the result. - In ripple joins, each new tuple gets joined with
all previously-seen tuples of the other
operand(s). - The relative rates of the two (or more) operands
are dynamically adjusted.
8Worst-case scenario
- Ripple join reduces to a nested loop join.
9The relations do not have to be relatively equal
size.
Aspect ratio how many tuples are retrieved from
each base relation per sampling step. e.g. ß1
1, ß2 3, Ripple join adjusts the aspect
ratio according to the sizes of the base
relations.
10Rectangular version
11What can the end user control?
- how many groups continue to process
- Any one group can be stopped.
- All other groups will continue to process
(faster). - the speed of the query selection process
- What happens to make the process faster?
- More tuples are skipped in the aggregation, so
the approximation will be less accurate, and the
confidence interval will be wider. - The end user controls the trade-off between speed
and accuracy.
12GUI, 1999
13Confidence interval
- A running confidence interval displays how close
this answer is to the final result. - This could be calculated in many ways.
- The authors present an example calculation built
on extending the Central Limit Theorem.
14Central Limit Theorem
- ˆµn is estimator for true µ
- average of the n values in the sample
- a random quantity
- CLT for large n
- (e.g. after joining 30 tuples),
- ˆµn has a normal distribution with mean µ and
variance s2 /n
15Random variable Z
- Shift and scale ˆµn to get a "standardized"
random variable Z - (ˆµn µ) / (s /vn)
- Z also has a standard normal distribution.
- There are a lot of ways to compute
- the zp values.
16"Interval" column on the GUI
- The authors use ˆsn as an estimator for true
variance - en ( zp ˆsn ) / vn
- This is displayed quantity as the final
half-width of the confidence interval.
17Why call this "Ripple Join"?
- The algorithm seems to ripple out from a corner
of the join. - Acronym "Rectangles of Increasing Perimeter
Length"
18Variants of ripple join
- Block ripple join
- Index ripple join
- Hash ripple join
19Performance
20Further publications
- Eddies Continuously Adaptive Query Processing,
by Ron Avnur and Joseph M. Hellerstein, MOD 2000,
Dallas - Confidence Bounds for Sampling-Based GROUP BY
Estimates, by Fei Xu, Christopher Jermaine, and
Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug.
2008) - Wavelet synopsis for hierarchical range queries
with workloads, by Sudipto Guha, Hyoungmin Park,
and Kyuseok Shim, VLDB Journal (2008)
1710791099
21Questions?