Congressional Samples for Approximate Answering of GroupBy Queries

About This Presentation

Title:

Congressional Samples for Approximate Answering of GroupBy Queries

Description:

... e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 19

Provided by: dbxla

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Congressional Samples for Approximate Answering of GroupBy Queries

1
Congressional Samples for Approximate Answering
of Group-By Queries

Swarup Acharya
Phillip B. Gibbons
Viswanath Poosala
Presented By Daniel Kuang

2
Outline

Problems with Group-By queries
Congressional sampling
Rewriting
Performance
Conclusion

3
Problems with Group-By Queries

Decision support queries routinely segment the
data into groups.
For example, a group-by query on the U.S. census
database could be used to determine the per
capita income per state. However ,there can be a
huge discrepancy in the sizes of different
groups, e.g., the state of California has nearly
70 times the population of Wyoming.
As a result, a uniform random sample of the
relation will contain disproportionately fewer
tuples from the smaller groups, which leads to
poor accuracy for answers on those groups because
accuracy is highly dependent on the number of
sample tuples that belong to that group.
Standard error is inversely proportional to vn
for uniform sample.
n is the uniform sample random size.

4
Solution (Congressional Sampling)

Consider US Congress which is hybrid of House
and Senate.
House has representative from each state in
proportion to its population.
Senate has equal number of representative from
each state.
Then apply House and Senate scenario for
representing different groups.
House sample Uniform random sampling
from each group .
Senate sample Sample an equal number of
tuples from each group.

5
Solution (Congressional Sampling)

Consider a relation R with two grouping
attributes A, and B
Number of tuples for the groups
(a1, b1) 3000, (a1, b2) 3000, (a1, b3)
1500, (a2, b3) -- 2500
Basic Congress (sample size 100)

6
Solution (Congressional Sampling)
7
Congressional Sampling

Basic congress sample size allocated to each
group
Congress sample size allocated to each group

8
Rewriting

Query rewriting involves two key steps
a) scaling up the aggregate expressions and
b) deriving error bounds on the estimate.
ScaleFactor be the inverse sampling rate for its
strata.
How to associate each tuple with its scalefactor
a) store the ScaleFactor(SF) with each tuple
in sample relation
b) use a separate table to store the
ScaleFactors for the groups

Relation Rel with two example tuples
9
Rewriting (Integrated Rewriting)
10
Normalized Rewriting
11
Key-normalized Rewriting
12
Nested-integrated Rewriting
13
Performance