Online Query Processing A Tutorial - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Online Query Processing A Tutorial

Description:

Usage examples. Basic sampling techniques and estimators. Preferential data delivery ... Examples of online techniques. Underlying technology ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 58
Provided by: ValuedGate2256
Category:

less

Transcript and Presenter's Notes

Title: Online Query Processing A Tutorial


1
Online Query ProcessingA Tutorial
  • Peter J. Haas
  • IBM Almaden Research Center
  • Joseph M. Hellerstein
  • UC Berkeley

2
Goals for Today
  • Exposure to online query processing algorithms
    and fundamentals
  • Usage examples
  • Basic sampling techniques and estimators
  • Preferential data delivery
  • Online join algorithms
  • Relation to OLAP, etc.
  • Some thoughts on research directions
  • More resources to appear on the web
  • Annotated bibliography
  • Extended slide set
  • Survey paper

3
Road Map
  • Background and motivation
  • Human-computer interaction
  • Tech trends and prognostications
  • Goals for online processing
  • Examples of online techniques
  • Underlying technology
  • Related work
  • Looking forward

4
Human-Computer Interaction
  • Iterative querying with progressive refinement
  • Real-time interaction (impatience!)
  • Spreadsheets, WYSIWYG editors
  • Modern statistics packages
  • Netscape STOP button
  • Visually-oriented interface
  • Approximate results are usually OK

VS
5
Disk Appetite
  • Greg Papadopoulos, CTO Sun
  • "Moore's Law Ain't Good Enough" (Hot Chips 98)

6
The Latest Commercial Technology
7
Drawbacks of Current Technology
  • Only exact answers are available
  • A losing proposition as data volume grows
  • Hardware improvements not sufficient
  • Interactive systems fail on massive data
  • E.g., spreadsheet programs (64Krow limit)
  • DBMS not interactive
  • No user feedback or control (back to the 60s)
  • Long processing times
  • Fundamental mismatch with preferred modes of HCI
  • OLAP a partial solution
  • Cant handle ad hoc queries or data sets

8
Goals for Online Processing
  • New greedy performance regime
  • Maximize 1st derivative of the mirth index
  • Mirth defined on-the-fly
  • Therefore need FEEDBACK and CONTROL

100
?
Time
9
Road Map
  • Background and Motivation
  • Examples of Online Techniques
  • Aggregation, visualization, cleaning/browsing
  • Underlying technology
  • Related work
  • Looking Forward

10
Online Aggregation
  • SELECT AVG(temp) FROM t GROUP BY site
  • 330K rows in table
  • the exact answer

11
Online Aggregation, contd
  • A simple online aggregation interface (after 74
    rows)

12
Online Aggregation, contd
  • After 834 rows

13
Example Online Aggregation
Additional Features Speed up Slow
down Terminate
14
Online Data Visualization
  • In Tioga DataSplash

15
Online Enumeration
  • Potters Wheel VLDB 2001
  • Scalable spreadsheet
  • A fraction of data is materialized in GUI widget
  • Scrolling preference for data delivery in a
    quantile
  • Permits fuzzy querying
  • Interactive data cleaning
  • Online structure and discrepancy detection
  • Online aggregation

16
Scalable Spreadsheets
17
Visual Transformation Shot
18
(No Transcript)
19
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related work
  • Looking forward

20
Sampling Design Issues
  • Granularity of sample
  • Instance-level (row-level) high I/O cost
  • Block-level (page-level) high variability from
    clustering
  • Type of sample
  • Often simple random sample (SRS)
  • Especially for on-the-fly
  • With/without replacement usually not critical
  • Data structure from which to sample
  • Files or relational tables
  • Indexes (B trees, etc)

21
Row-level Sampling Techniques
  • Maintain file in random order
  • Sampling scan
  • Is file initially in random order?
  • Statistical tests needed e.g., Runs test,
    Smirnov test
  • In DB systems cluster via RAND function
  • Must freshen ordering (online reorg)
  • On-the-fly sampling
  • Via index on random column
  • Else get random page, then row within page
  • Ex extent-map sampling
  • Problem variable number of records on page

22
Acceptance/Rejection Sampling
  • Accept row on page i with probability ni/nMAX
  • Commonly used in other settings
  • E.g. sampling from joins
  • E.g. sampling from indexes

Original pages
Modified pages
23
Cost of Row-Level Sampling
  • 100,000 pages
  • 200 rows/page

24
Estimation for Aggregates
  • Point estimates
  • Easy SUM, COUNT, AVERAGE
  • Hard MAX, MIN, quantiles, distinct values
  • Confidence intervals a measure of precision
  • Two cases single-table and joins

25
Confidence Intervals
26
The Good and Bad News
  • Good news 1/n1/2 magic (n chosen on-the-fly)
  • Bad news needle-in-a-haystack problem

27
Sampling Deployed in Industry
  • Simulated Bernoulli sampling
  • SQL SELECT WHERE RAND() lt 0.01
  • Similar capability in SAS
  • Bernoulli Sampling with pre-specified rate
  • Informix, Oracle 8i, (DB2)
  • Ex SELECT FROM T1 SAMPLE ROW(10), T2
  • Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
  • Not for novices
  • Need to pre-specify precision
  • no feedback/control
  • recall the multiresolution patterns from
    example
  • No estimators provided in current systems

28
Precomputation Techniques
  • Two components
  • Data reduction (often expensive)
  • Approximate reconstruction (quick)
  • Pros and cons
  • Efficiency vs flexibility
  • Class of queries that can be handled
  • Degree of precision
  • Ease of implementation
  • How much of system must be modified
  • How sophisticated must developer be?
  • More widely deployed in industry
  • Will give overview later

29
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related technology precomputation
  • Looking forward

30
Preferential Data Delivery
  • Why needed
  • Speedup/slowdown arrows
  • Spreadsheet scrollbars
  • Pipeline quasi-sort
  • Continuous re-optimization (eddies)
  • Index stride
  • High I/O costs, good for outliers
  • Online Reordering (Juggle)
  • Excellent in most cases, no index required
  • VLDB 99, VLDBJ 00

31
Online Reordering
  • Deliver interesting items first
  • Interesting determined on the fly
  • Exploit rate gap between produce and
    process/consume

32
Online Reordering
  • Deliver interesting items first
  • Interesting determined on the fly
  • Exploit rate gap between produce and
    process/consume

33
Mechanism
  • Two threads -- prefetch from input
    -- spool/enrich from auxiliary side disk
  • Juggle data between buffer and side disk
  • keep buffer full of interesting items
  • getNext chooses best item currently on buffer
  • getNext, enrich/spool decisions -- based on
    reordering policy
  • Side disk management
  • hash index, populated in a way that postpones
    random I/O
  • play both sides of sort/hash duality

34
Policies
?
QOF
time
  • good permutation of items t1tn to t?1t?n
  • quality of feedback for a prefix t?1t?2t?k
  • QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
    preference
  • determined by application
  • goodness of reordering dQOF/dt
  • implication for juggle mechanism
  • process gets item from buffer that increases QOF
    the most
  • juggle tries to maintain buffer with such items

35
QOF in Online Aggregation
  • avg weighted confidence interval
  • preference acts as weight on confidence interval
  • QOF ?? UPi / (ni)½, ni number of tuples
    processed from group i
  • process pulls items from group with max UPi /
    ni3/2
  • desired ratio of group i in buffer UPi2/3/?j
    UPj2/3
  • juggle tries to maintain this by enrich/spool
  • Similar derivations for other preferences
  • e.g. explicit rates, explicit ranking, etc.

36
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation,
    pre-computation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related work
  • Looking forward

37
Pipelined Data Processing
  • Never, ever wait for anything to finish
  • Selection no problem
  • Grouping hash, dont sort
  • Sorting juggle if possible
  • Joins?
  • Sample of joins vs. join of samples

SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
38
Traditional Nested Loops
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
39
Ripple Joins
  • SIGMOD 99
  • designed for online performance goals
  • Completely pipelined
  • Adapt to data characteristics
  • designed for online performance goals
  • simplest version
  • read new tuples s from S and r from R
  • join r and s
  • join r with old S tuples
  • join s with old R tuples

40
Basic Ripple Join
R
x
S
41
Block Ripple Joins (Size 2)
R
  • S

42
Rectangular Ripple Join
R
  • S

43
Ripple Joins, contd
  • Variants
  • Block minimizes I/O in alternating nested loops
  • Index coincides with index-nested loop
  • Hash symmetric hash tables
  • Adaptive aspect ratio
  • User sets animation rate (via slider)
  • System goal
  • minimize CI length
  • Subject to time constraint
  • System solves optimization problem
    (approximately)
  • Samples from higher-variance relation faster

44
Ripple Joins, contd
  • Prototypes in Informix, IBM DB2
  • Ongoing work on scalability issues
  • Memory compaction technique
  • Parallelism
  • Graceful degradation to out-of-core hashing
  • a la Tukwila, XJoin, but sensitive to statistical
    issues
  • Nested queries
  • Optimization issues
  • A number of API and other systems issues
  • DMKD journal paper on Informix implementation
  • Forthcoming paper on sampling in DB2

45
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Related work
  • Online query processing
  • Precomputation
  • Looking forward

46
Related Work on Online QP
  • Morgensteins PhD, Berkeley 80
  • Online Association Rules
  • Ng, et als CAP, SIGMOD 98
  • Hidbers CARMA, SIGMOD 99
  • Implications for deductive DB semantics
  • Monotone aggregation in LDL, Zaniolo and Wang
  • Online agg with subqueries
  • Tan, et al. VLDB 99
  • Dynamic Pipeline Scheduling
  • Urhan/Franklin VLDB 01
  • Pipelining Hash Joins
  • Raschid, Wilschut/Apers, Tukwila, Xjoin
  • Relation to semi-naive evaluation
  • Anytime Algorithms
  • Zilberstein, Russell, et al.

47
Precomputation Explicit
  • OLAP Data Cubes (drill-down hierarchies)
  • MOLAP, ROLAP, HOLAP
  • Semantic hierarchies
  • APPROXIMATE (Vrbsky, et al.)
  • Query Relaxation, e.g. CoBase
  • Multiresolution Data Models (Silberschatz/Reed/Fus
    sell)
  • More general materialized views
  • See Gupta/Mumicks text

48
Precomputation Stat. Summaries
  • Histograms
  • Originally for aggregation queries, many flavors
  • Extended to enumeration queries recently
  • Multi-dimensional histograms
  • Parametric estimation
  • Wavelets and Fractals
  • Discrete cosine transform
  • Regression
  • Curve fitting and splines
  • Singular-Value Decomposition (aka LSI, PCA)
  • Indexes hierarchical histograms
  • Ranking and pseudo-ranking
  • Aokis use of GiSTs as estimators for ADTs
  • Data Mining
  • Clustering, classification, other
    multidimensional models

49
Precomputed Samples
  • Materialized sample views
  • Olkens original work
  • Chaudhuri et al. join samples
  • Statistical inferences complicated over
    recycled samples?
  • Barbarás quasi-cubes
  • AQUA join synopses on universal relation
  • Maintenance issues
  • AQUAs backing samples
  • Can use fancier/more efficient sampling
    techniques
  • Stratified sampling or AQUAs congressional
    samples
  • Haas and Swami AFV statistics
  • Combine precomputed outliers with on-the-fly
    samples

50
Stratified Sampling
51
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Related Work
  • Looking forward
  • Adaptive systems
  • Human-centered systems

52
Looking Forward Adaptive Systems
Observe Environment
Act
Make Decision
  • Observation/Decision ? Modeling/Prediction
  • usually statistical
  • Already critically important in todays systems
  • And imagine how important in ubiquitous computing!

53
A DBMS Tradition
  • One instance System R optimization
  • Observe Runstats
  • Decide Query Optimization
  • Act Query Processing
  • A powerful aspect of our technologies
  • Data independence declarative languages
  • Yet quite coarse-grained
  • Runstats once per day/week
  • Actions only per-query
  • Disk resource management index and matview
    selection
  • Memory resource management buffers and sort/hash
    space
  • Concurrency management admission control

54
Built-in adaptivity
  • Info systems should have adaptivity as a basic
    goal
  • Not just best-case performance
  • Needs to pervade system
  • Core architectural work to be done here
  • E.g. pipelining required for multi-operator
    adaptivity
  • Observe more than one thing at a time
  • E.g. adaptive operators (a la ripple join)
  • E.g. adaptive optimization architectures (a la
    Eddies)
  • E.g. unify query processing with database design
  • Adaptivity should be built-in, not bolted-on
  • Wizards to turn existing knobs
  • Less helpful
  • Certainly less elegant
  • Might be technically more difficult!

55
Looking ForwardHuman-Centered Systems
  • Annual plea for UI work in DB Directions
    Workshops
  • UIs perceived as soft, hard to measure/publish
  • Yet people use our systems
  • And arguably we are trying to make them better
    for people
  • Problem our performance metrics
  • Mirth index vs. wall-clock time
  • One can find reasonable hard metrics for mirth
  • Many of these metrics may be statistical
  • Also consider woe index, e.g. in
    maintainability
  • Most of these indices have to do with user time
  • Not, e.g., resource utilization
  • Good UI work need not require good UIs!
  • Can attack new metrics directly
  • We dont have to go back to art school

56
Lessons Learned
  • Dream about UIs, work on systems
  • User needs drive systems design!
  • Systems and statistics intertwine
  • All 3 go together naturally
  • User desires and behavior 2 more things to
    model, predict
  • Performance metrics need to reflect key user
    needs
  • What unlike things must meet and mate
  • -- Art, Herman Melville

57
More?
  • Annotated bibliography slides soon
  • http//control.cs.berkeley.edu/sigmod01/
Write a Comment
User Comments (0)
About PowerShow.com