Online Query Processing A Tutorial

About This Presentation

Title:

Online Query Processing A Tutorial

Description:

Usage examples. Basic sampling techniques and estimators. Preferential data delivery ... Examples of online techniques. Underlying technology ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 58

Provided by: ValuedGate2256

Learn more at: http://control.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Online Query Processing A Tutorial

1
Online Query ProcessingA Tutorial

Peter J. Haas
IBM Almaden Research Center
Joseph M. Hellerstein
UC Berkeley

2
Goals for Today

Exposure to online query processing algorithms
and fundamentals
Usage examples
Basic sampling techniques and estimators
Preferential data delivery
Online join algorithms
Relation to OLAP, etc.
Some thoughts on research directions
More resources to appear on the web
Annotated bibliography
Extended slide set
Survey paper

3
Road Map

Background and motivation
Human-computer interaction
Tech trends and prognostications
Goals for online processing
Examples of online techniques
Underlying technology
Related work
Looking forward

4
Human-Computer Interaction

Iterative querying with progressive refinement
Real-time interaction (impatience!)
Spreadsheets, WYSIWYG editors
Modern statistics packages
Netscape STOP button
Visually-oriented interface
Approximate results are usually OK

VS
5
Disk Appetite

Greg Papadopoulos, CTO Sun
"Moore's Law Ain't Good Enough" (Hot Chips 98)

6
The Latest Commercial Technology
7
Drawbacks of Current Technology

Only exact answers are available
A losing proposition as data volume grows
Hardware improvements not sufficient
Interactive systems fail on massive data
E.g., spreadsheet programs (64Krow limit)
DBMS not interactive
No user feedback or control (back to the 60s)
Long processing times
Fundamental mismatch with preferred modes of HCI
OLAP a partial solution
Cant handle ad hoc queries or data sets

8
Goals for Online Processing

New greedy performance regime
Maximize 1st derivative of the mirth index
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL

100
?
Time
9
Road Map

Background and Motivation
Examples of Online Techniques
Aggregation, visualization, cleaning/browsing
Underlying technology
Related work
Looking Forward

10
Online Aggregation

SELECT AVG(temp) FROM t GROUP BY site
330K rows in table
the exact answer

11
Online Aggregation, contd

A simple online aggregation interface (after 74
rows)

12
Online Aggregation, contd

After 834 rows

13
Example Online Aggregation
Additional Features Speed up Slow
down Terminate
14
Online Data Visualization

In Tioga DataSplash

15
Online Enumeration

Potters Wheel VLDB 2001
Scalable spreadsheet
A fraction of data is materialized in GUI widget
Scrolling preference for data delivery in a
quantile
Permits fuzzy querying
Interactive data cleaning
Online structure and discrepancy detection
Online aggregation

16
Scalable Spreadsheets
17
Visual Transformation Shot
18
(No Transcript)
19
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation
Preferential data delivery
Pipelined adaptive processing algorithms
Related work
Looking forward

20
Sampling Design Issues

Granularity of sample
Instance-level (row-level) high I/O cost
Block-level (page-level) high variability from
clustering
Type of sample
Often simple random sample (SRS)
Especially for on-the-fly
With/without replacement usually not critical
Data structure from which to sample
Files or relational tables
Indexes (B trees, etc)

21
Row-level Sampling Techniques

Maintain file in random order
Sampling scan
Is file initially in random order?
Statistical tests needed e.g., Runs test,
Smirnov test
In DB systems cluster via RAND function
Must freshen ordering (online reorg)
On-the-fly sampling
Via index on random column
Else get random page, then row within page
Ex extent-map sampling
Problem variable number of records on page

22
Acceptance/Rejection Sampling

Accept row on page i with probability ni/nMAX
Commonly used in other settings
E.g. sampling from joins
E.g. sampling from indexes

Original pages
Modified pages
23
Cost of Row-Level Sampling

100,000 pages
200 rows/page

24
Estimation for Aggregates

Point estimates
Easy SUM, COUNT, AVERAGE
Hard MAX, MIN, quantiles, distinct values
Confidence intervals a measure of precision
Two cases single-table and joins

25
Confidence Intervals
26
The Good and Bad News

Good news 1/n1/2 magic (n chosen on-the-fly)
Bad news needle-in-a-haystack problem

27
Sampling Deployed in Industry

Simulated Bernoulli sampling
SQL SELECT WHERE RAND() lt 0.01
Similar capability in SAS
Bernoulli Sampling with pre-specified rate
Informix, Oracle 8i, (DB2)
Ex SELECT FROM T1 SAMPLE ROW(10), T2
Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
Not for novices
Need to pre-specify precision
no feedback/control
recall the multiresolution patterns from
example
No estimators provided in current systems

28
Precomputation Techniques

Two components
Data reduction (often expensive)
Approximate reconstruction (quick)
Pros and cons
Efficiency vs flexibility
Class of queries that can be handled
Degree of precision
Ease of implementation
How much of system must be modified
How sophisticated must developer be?
More widely deployed in industry
Will give overview later

29
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation
Preferential data delivery
Pipelined adaptive processing algorithms
Related technology precomputation
Looking forward

30
Preferential Data Delivery

Why needed
Speedup/slowdown arrows
Spreadsheet scrollbars
Pipeline quasi-sort
Continuous re-optimization (eddies)
Index stride
High I/O costs, good for outliers
Online Reordering (Juggle)
Excellent in most cases, no index required
VLDB 99, VLDBJ 00

31
Online Reordering

Deliver interesting items first
Interesting determined on the fly
Exploit rate gap between produce and
process/consume

32
Online Reordering

Deliver interesting items first
Interesting determined on the fly
Exploit rate gap between produce and
process/consume

33
Mechanism

Two threads -- prefetch from input
-- spool/enrich from auxiliary side disk
Juggle data between buffer and side disk
keep buffer full of interesting items
getNext chooses best item currently on buffer
getNext, enrich/spool decisions -- based on
reordering policy
Side disk management
hash index, populated in a way that postpones
random I/O
play both sides of sort/hash duality

34
Policies
?
QOF
time

good permutation of items t1tn to t?1t?n
quality of feedback for a prefix t?1t?2t?k
QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference
determined by application
goodness of reordering dQOF/dt
implication for juggle mechanism
process gets item from buffer that increases QOF
the most
juggle tries to maintain buffer with such items

35
QOF in Online Aggregation

avg weighted confidence interval
preference acts as weight on confidence interval
QOF ?? UPi / (ni)½, ni number of tuples
processed from group i
process pulls items from group with max UPi /
ni3/2
desired ratio of group i in buffer UPi2/3/?j
UPj2/3
juggle tries to maintain this by enrich/spool
Similar derivations for other preferences
e.g. explicit rates, explicit ranking, etc.

36
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation,
pre-computation
Preferential data delivery
Pipelined adaptive processing algorithms
Related work
Looking forward

37
Pipelined Data Processing

Never, ever wait for anything to finish
Selection no problem
Grouping hash, dont sort
Sorting juggle if possible
Joins?
Sample of joins vs. join of samples

SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
38
Traditional Nested Loops
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
39
Ripple Joins

SIGMOD 99

designed for online performance goals
Completely pipelined
Adapt to data characteristics
designed for online performance goals
simplest version
read new tuples s from S and r from R
join r and s
join r with old S tuples
join s with old R tuples

40
Basic Ripple Join
R
x
S
41
Block Ripple Joins (Size 2)
R

42
Rectangular Ripple Join
R

43
Ripple Joins, contd

Variants
Block minimizes I/O in alternating nested loops
Index coincides with index-nested loop
Hash symmetric hash tables
Adaptive aspect ratio
User sets animation rate (via slider)
System goal
minimize CI length
Subject to time constraint
System solves optimization problem
(approximately)
Samples from higher-variance relation faster

44
Ripple Joins, contd

Prototypes in Informix, IBM DB2
Ongoing work on scalability issues
Memory compaction technique
Parallelism
Graceful degradation to out-of-core hashing
a la Tukwila, XJoin, but sensitive to statistical
issues
Nested queries
Optimization issues
A number of API and other systems issues
DMKD journal paper on Informix implementation
Forthcoming paper on sampling in DB2

45
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Related work
Online query processing
Precomputation
Looking forward

46
Related Work on Online QP

Morgensteins PhD, Berkeley 80
Online Association Rules
Ng, et als CAP, SIGMOD 98
Hidbers CARMA, SIGMOD 99
Implications for deductive DB semantics
Monotone aggregation in LDL, Zaniolo and Wang
Online agg with subqueries
Tan, et al. VLDB 99
Dynamic Pipeline Scheduling
Urhan/Franklin VLDB 01
Pipelining Hash Joins
Raschid, Wilschut/Apers, Tukwila, Xjoin
Relation to semi-naive evaluation
Anytime Algorithms
Zilberstein, Russell, et al.

47
Precomputation Explicit

OLAP Data Cubes (drill-down hierarchies)
MOLAP, ROLAP, HOLAP
Semantic hierarchies
APPROXIMATE (Vrbsky, et al.)
Query Relaxation, e.g. CoBase
Multiresolution Data Models (Silberschatz/Reed/Fus
sell)
More general materialized views
See Gupta/Mumicks text

48
Precomputation Stat. Summaries

Histograms
Originally for aggregation queries, many flavors
Extended to enumeration queries recently
Multi-dimensional histograms
Parametric estimation
Wavelets and Fractals
Discrete cosine transform
Regression
Curve fitting and splines
Singular-Value Decomposition (aka LSI, PCA)
Indexes hierarchical histograms
Ranking and pseudo-ranking
Aokis use of GiSTs as estimators for ADTs
Data Mining
Clustering, classification, other
multidimensional models

49
Precomputed Samples

Materialized sample views
Olkens original work
Chaudhuri et al. join samples
Statistical inferences complicated over
recycled samples?
Barbarás quasi-cubes
AQUA join synopses on universal relation
Maintenance issues
AQUAs backing samples
Can use fancier/more efficient sampling
techniques
Stratified sampling or AQUAs congressional
samples
Haas and Swami AFV statistics
Combine precomputed outliers with on-the-fly
samples

50
Stratified Sampling
51
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Related Work
Looking forward
Adaptive systems
Human-centered systems

52
Looking Forward Adaptive Systems
Observe Environment
Act
Make Decision

Observation/Decision ? Modeling/Prediction
usually statistical
Already critically important in todays systems
And imagine how important in ubiquitous computing!

53
A DBMS Tradition

One instance System R optimization
Observe Runstats
Decide Query Optimization
Act Query Processing
A powerful aspect of our technologies
Data independence declarative languages
Yet quite coarse-grained
Runstats once per day/week
Actions only per-query
Disk resource management index and matview
selection
Memory resource management buffers and sort/hash
space
Concurrency management admission control

54
Built-in adaptivity

Info systems should have adaptivity as a basic
goal
Not just best-case performance
Needs to pervade system
Core architectural work to be done here
E.g. pipelining required for multi-operator
adaptivity
Observe more than one thing at a time
E.g. adaptive operators (a la ripple join)
E.g. adaptive optimization architectures (a la
Eddies)
E.g. unify query processing with database design
Adaptivity should be built-in, not bolted-on
Wizards to turn existing knobs
Less helpful
Certainly less elegant
Might be technically more difficult!

55
Looking ForwardHuman-Centered Systems

Annual plea for UI work in DB Directions
Workshops
UIs perceived as soft, hard to measure/publish
Yet people use our systems
And arguably we are trying to make them better
for people
Problem our performance metrics
Mirth index vs. wall-clock time
One can find reasonable hard metrics for mirth
Many of these metrics may be statistical
Also consider woe index, e.g. in
maintainability
Most of these indices have to do with user time
Not, e.g., resource utilization
Good UI work need not require good UIs!
Can attack new metrics directly
We dont have to go back to art school

56
Lessons Learned

Dream about UIs, work on systems
User needs drive systems design!
Systems and statistics intertwine
All 3 go together naturally
User desires and behavior 2 more things to
model, predict
Performance metrics need to reflect key user
needs
What unlike things must meet and mate
-- Art, Herman Melville

Online Query Processing A Tutorial - PowerPoint PPT Presentation

Online Query Processing A Tutorial

Usage examples. Basic sampling techniques and estimators. Preferential data delivery ... Examples of online techniques. Underlying technology ... – PowerPoint PPT presentation