Online Aggregation Joseph M' Hellerstein Peter J'Haas Helen J'Wang - PowerPoint PPT Presentation

About This Presentation
Title:

Online Aggregation Joseph M' Hellerstein Peter J'Haas Helen J'Wang

Description:

2 components in cost function: ... VAR,STD DEV-algorithms. return confidence intervals. API. Current API uses built-in methods ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 18
Provided by: mohanrathi
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Online Aggregation Joseph M' Hellerstein Peter J'Haas Helen J'Wang


1
Online AggregationJoseph M. HellersteinPeter
J.HaasHelen J.Wang
  • Presented by
  • Archana Vijayalakshmanan

2
Contents
  • Introduction
  • Example
  • Advantages
  • Requirements
  • Approaches to building a system
  • System issues
  • Conclusion

3
Online Aggregation Motivation
  • Select AVG(grade) from ENROLL
  • A fancy interface

4
A Better Approach
  • Dont process in batch! Online aggregation

5
Example
  • Select AVG(grade) from ENROLL
  • GROUP BY major

6
Advantages
  • stopping condition set on the fly!
  • statistical techniques are more sophisticated
  • can handle GROUP BY w/o a priori knowledge

7
Requirements
  • Usability
  • Continuous output
  • non-blocking query plans
  • time/precision control
  • fairness/partiality
  • Performance
  • time to accuracy
  • time to completion
  • pacing

8
A Naive Approach
  • SELECT running_avg(final_grade),
  • running_confidence(final_grade),
  • running_interval(final_grade) FROM grades
  • No grouping
  • Cant meet performance usability needs
  • no guarantee of continuous output
  • no guarantee of fairness (or control over
    partiality)
  • no control over pacing

9
Random Access to Data
  • Heap Scan
  • OK if clustering uncorrelated to agg grouping
    attrs
  • Index Scan
  • can scan an index on attrs uncorrelated to agg
    or grouping
  • Sampling from indices
  • could introduce new sampling access methods
    (e.g. Olkens work)

10
Group By Distinct
  • Cant sort!
  • sorting blocks
  • sorting is unfair
  • Must use hash-based techniques
  • non-blocking approach but do not scale
    gracefully.
  • Hybrid Hashing.
  • Hybrid Cache even better.

11
Index Striding
  • For fair Group By
  • read tuples in round-robin fashion.
  • (want random tuple from Group 1, random tuple
    from Group 2, ...)
  • each group is updated at appropriate rate.
  • gives info/speed match!

12
Join Algorithms
  • Non-Blocking Joins
  • no sorting!
  • merge join OK, but watch for the sorted output
  • hybrid hash not great
  • symmetric pipeline hash
  • nested loops always good, can be too slow

13
Query Optimization
  • Avoid sorting
  • Blocking sub-operations
  • 2 components in cost function
  • dead time (td ) time spent doing invisible
    work -- tax this at a high rate!
  • output time (to ) time spent producing output
  • Preference to plans that maximize user control
    e.g., index striding

14
Extended Aggregate Functions
  • Basically,aggregate functions must provide
    running estimates
  • SUM,COUNT-straight forward
  • VAR,STD DEV-algorithms
  • return confidence intervals

15
API
  • Current API uses built-in methods
  • e.g., StopGroup(cursor,groupval)
  • speedUpGroup(cursor,groupval)
  • slowDownGroup(cursor,groupval)
  • setSkipFactor(cursor name,integer)

16
Future Work
  • Better UI
  • -online data visualization (Tioga DataSplash)
  • data viz graphical aggregate
  • - drill down and roll up facilities
  • Nested Queries
  • Control w/o Indices
  • Checkpointing/continuation
  • Tracking online queries
  • Extensions of statistical results

17
References
  • control.cs.berkeley.edu/online/olamd/olamd.PPT
Write a Comment
User Comments (0)
About PowerShow.com