Congressional Samples for Approximate Answering of GroupBy Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Congressional Samples for Approximate Answering of GroupBy Queries

Description:

... e.g., the state of California has nearly 70 times the population of Wyoming. As a result, a uniform random sample of the relation will contain ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 19
Provided by: dbxla
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Congressional Samples for Approximate Answering of GroupBy Queries


1
Congressional Samples for Approximate Answering
of Group-By Queries
  • Swarup Acharya
  • Phillip B. Gibbons
  • Viswanath Poosala
  • Presented By Daniel Kuang

2
Outline
  • Problems with Group-By queries
  • Congressional sampling
  • Rewriting
  • Performance
  • Conclusion

3
Problems with Group-By Queries
  • Decision support queries routinely segment the
    data into groups.
  • For example, a group-by query on the U.S. census
    database could be used to determine the per
    capita income per state. However ,there can be a
    huge discrepancy in the sizes of different
    groups, e.g., the state of California has nearly
    70 times the population of Wyoming.
  • As a result, a uniform random sample of the
    relation will contain disproportionately fewer
    tuples from the smaller groups, which leads to
    poor accuracy for answers on those groups because
    accuracy is highly dependent on the number of
    sample tuples that belong to that group.
  • Standard error is inversely proportional to vn
    for uniform sample.
  • n is the uniform sample random size.

4
Solution (Congressional Sampling)
  • Consider US Congress which is hybrid of House
    and Senate.
  • House has representative from each state in
    proportion to its population.
  • Senate has equal number of representative from
    each state.
  • Then apply House and Senate scenario for
    representing different groups.
  • House sample Uniform random sampling
    from each group .
  • Senate sample Sample an equal number of
    tuples from each group.

5
Solution (Congressional Sampling)
  • Consider a relation R with two grouping
    attributes A, and B
  • Number of tuples for the groups
  • (a1, b1) 3000, (a1, b2) 3000, (a1, b3)
    1500, (a2, b3) -- 2500
  • Basic Congress (sample size 100)

6
Solution (Congressional Sampling)
7
Congressional Sampling
  • Basic congress sample size allocated to each
    group
  • Congress sample size allocated to each group

8
Rewriting
  • Query rewriting involves two key steps
  • a) scaling up the aggregate expressions and
  • b) deriving error bounds on the estimate.
  • ScaleFactor be the inverse sampling rate for its
    strata.
  • How to associate each tuple with its scalefactor
  • a) store the ScaleFactor(SF) with each tuple
    in sample relation
  • b) use a separate table to store the
    ScaleFactors for the groups

Relation Rel with two example tuples
9
Rewriting (Integrated Rewriting)
10
Normalized Rewriting
11
Key-normalized Rewriting
12
Nested-integrated Rewriting
13
Performance
  • Three Queries
  • Grouping on returnflag, linestatus, shipdate
  • skewed group sizes z 1.5
  • Sample Percentage at 7

14
Performance
15
Performance
16
Performance
17
Performance
Times taken for different sample
percentages Actual query time 40sec
18
Conclusions
  • Congressional samples are effective for group-by
    queries with arbitrary group-bys (including none)
Write a Comment
User Comments (0)
About PowerShow.com