Ripple Joins for Online Aggregation - PowerPoint PPT Presentation

About This Presentation

Title:

Ripple Joins for Online Aggregation

Description:

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 22

Provided by: Rayt190

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ripple Joins for Online Aggregation

1
Ripple Joins for Online Aggregation

by
Peter J. Haas and Joseph M. Hellerstein
published in June 1999
presented by
Ronda Hilton

2
Overview

This paper tells how to join a bunch of tables
and get the SUM, COUNT, or AVG
in GROUP BY clauses
showing approximate results immediately
and the confidence interval of the results
from the first few tuples retrieved
updating a GUI display with closer approximation
information as the join adds more tuples.

3
Ripple joins compared to our previous topics

General research area algorithms
another approximation algorithm
online processing
not maintaining a sample set
aggregate queries joins, and group-by
requires random retrieval
uses probabilistic calculations to determine the
quality of the approximate result
not optimizing
implemented as middleware on the DBMS

4
Traditional Hash Join stores the smaller relation
in memory

Two relations R and S
with a common attribute
on each distinct value of that attribute,
match up the tuples which have the same value.

Example
select R.roomnumber,
COUNT(S.homeroom)
from Rooms R
join Student S on
R.roomnumberS.homeroom

For each tuple r in R
add hash(roomnumber) to the hashtable in memory
if hashtable has filled up memory
for every tuple s in S
if hash(homeroom) is found in the hashtable
add tuple r and tuple s to the output
reset the hashtable
Finally, scan S and add the resulting join tuples
to the output.

5
What's different about ripple join?

Traditional hash join blocks until the entire
query output is finished.
Ripple join reports approximate results after
each sampling step, and allows user intervention.
In the inner loop, an entire table is scanned.
Ripple join expands the sample set incrementally.

6
The most important difference

The tuples are processed in random order.

7
Pipelining

In pipelining join algorithms, as the join
progresses, more and more information gets added
to the result.
In ripple joins, each new tuple gets joined with
all previously-seen tuples of the other
operand(s).
The relative rates of the two (or more) operands
are dynamically adjusted.

8
Worst-case scenario

Ripple join reduces to a nested loop join.

9
The relations do not have to be relatively equal
size.
Aspect ratio how many tuples are retrieved from
each base relation per sampling step. e.g. ß1
1, ß2 3, Ripple join adjusts the aspect
ratio according to the sizes of the base
relations.
10
Rectangular version
11
What can the end user control?

how many groups continue to process
Any one group can be stopped.
All other groups will continue to process
(faster).
the speed of the query selection process
What happens to make the process faster?
More tuples are skipped in the aggregation, so
the approximation will be less accurate, and the
confidence interval will be wider.
The end user controls the trade-off between speed
and accuracy.

12
GUI, 1999
13
Confidence interval

A running confidence interval displays how close
this answer is to the final result.
This could be calculated in many ways.
The authors present an example calculation built
on extending the Central Limit Theorem.

14
Central Limit Theorem

ˆµn is estimator for true µ
average of the n values in the sample
a random quantity
CLT for large n
(e.g. after joining 30 tuples),
ˆµn has a normal distribution with mean µ and
variance s2 /n

15
Random variable Z

Shift and scale ˆµn to get a "standardized"
random variable Z
(ˆµn µ) / (s /vn)
Z also has a standard normal distribution.
There are a lot of ways to compute
the zp values.

16
"Interval" column on the GUI

The authors use ˆsn as an estimator for true
variance
en ( zp ˆsn ) / vn
This is displayed quantity as the final
half-width of the confidence interval.

17
Why call this "Ripple Join"?

The algorithm seems to ripple out from a corner
of the join.
Acronym "Rectangles of Increasing Perimeter
Length"

18
Variants of ripple join

Block ripple join
Index ripple join
Hash ripple join

19
Performance
20
Further publications

Eddies Continuously Adaptive Query Processing,
by Ron Avnur and Joseph M. Hellerstein, MOD 2000,
Dallas
Confidence Bounds for Sampling-Based GROUP BY
Estimates, by Fei Xu, Christopher Jermaine, and
Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug.
2008)
Wavelet synopsis for hierarchical range queries
with workloads, by Sudipto Guha, Hyoungmin Park,
and Kyuseok Shim, VLDB Journal (2008)
1710791099

21
Questions?

Write a Comment

User Comments (0)