HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries

About This Presentation

Title:

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries

Description:

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries. Xiaohui Yu ... minimal overhead at selectivity estimation time. Cons: ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 32

Provided by: xiaoh1

Category:

more less

Transcript and Presenter's Notes

Title: HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries

1
HASE A Hybrid Approach to Selectivity Estimation
for Conjunctive Queries

Xiaohui Yu
University of Toronto
xhyu_at_cs.toronto.edu
Joint work with Nick Koudas (University of
Toronto) and Calisto Zuzarte (IBM Toronto Lab)

2
Outline

Background
Motivation
Related Work
HASE
Estimator
Algorithms
Bounds
Experiments
Conclusions

3
Query Optimization

Execution plans differ in costs
Difference can be huge (1 sec vs. 1 hour)
Which Plan to Choose??
Query Optimization
Estimate the costs of different plans
Choose the plan with the least cost
Cost Estimation
Factors run-time environments, data properties,

4
Selectivity

Important factor in costing selectivity
Fraction of records satisfying the predicate (s)
E.g., 100 out of 10,000 records having salary gt
3000 s 100/10000 0.01
Selectivity can make a big difference

Plan 2 Index scan
cost
Cost s const2
Plan 1 Table scan
Cost const1
Selectivity (s)
s 0.01
5
Related Work

Two streams
Synopsis-based
Sampling-based
Synopses
Capture the characteristics of data
Obtained off-line, used on-line
E.g., Histograms

6
Histograms
1700
1000
1500
800
2500
3500
5000
6000
Salary
Q Selectivity of salarygt3000?
7
Synopses pros and cons

Pros
Built offline can be used many times
minimal overhead at selectivity estimation time
Cons
Difficult to capture all useful information in a
limited space
Correlation between attributes

8
Sampling
Number of records in the table 10,000 Sample
size 100 Number of records having age gt 50 and
salary gt 5500 12 Selectivity estimate 12/100
0.12 True selectivity 0.09
9
Sampling pros and cons

The good
Provides correlation info through the sample
The bad
Cost, cost
Accurate results require a large portion of the
data to be accessed
Random access is much slower than sequential
access

10
Summary
Take the best of both worlds?
Capture correlation reduce sampling rate
11
Outline

Background
Motivation
Related Work
Our approach HASE
Estimator
Algorithms
Bounds
Experiments
Conclusions

12
HASE

Hybrid approach to selectivity estimation

Goal Consistent utilization of both sources of
information

Benefits
Correlation is captured (sampling)
Sample size can be significantly smaller
(histograms)

13
Problem setting
Data
Table of size N
Conjuncts of predicates Q P1P2P3
(agegt50)(salarygt5500)(hire_dategt01-01-05) P1
P2
P3
Query

Selectivities of individual predicates (obtained
from synopses) s1 0.1, s2 0.2, s3 0.05
A Sample S of n records Inclusion probability
of record j ?j For simple random sampling
(SRS) ?j n/N

Available info
Goal
Estimate the selectivity s of the query Q
14
Example
Table R with 10,000 records Query Q P1P2 on
two attributes Suppose 500 records satisfy both
predicates True Selectivity s 500/10000 0.05
15
Histogram-based estimate
Assuming independence between attributes Selectivi
ty estimate
Based on the histograms, s1 0.6, s2
0.3 Relative error 0.18 0.05 /0.05
260
16
Sampling-based estimate
Sample weight of j dj 1/ ?j
Indicator variable
Selectivity Estimate (HT estimator)
Take a SRS of size 100 ? dj 10000/100 100 9
records satisfy Q Estimate 9100/10000
0.09 Relative error 0.05 0.09 / 0.05
80
17
A new estimator
Original weights
Known selectivities (through histograms) s1, s2,

New weights
Calibration estimator
wj (1) reproduce known selectivities of
individual predicates (2) as close to dj as
possible
18
Consistency with known selectivities
Observed frequencies from sample
100 sample records from 10,000 records in the
table ? dj 100
s1 0.6
19
Calibration estimator
Why do we want wj to be as close as dj as
possible?
dj have the property of producing unbiased
estimates
Keep wj as close to dj as possible
wj remain nearly unbiased
20
Constrained optimization problem
Distance function D(x) (x wj /dj )
(As close to dj as possible)
w.r.t. wj
Minimize
Subject to
(reproduce known selectivities)
Yes 1
j satisfies Pi?
No 0
21
An algorithm based on Newtons method
Method of Lagrange multipliers Minimize

w.r.t.
where
Can be solved using Newtons method via an
iterative procedure. ? ? wj ?
22
An alternative algorithm
23
Example
Observed frequencies from sample
24
Distance measures

Requirements on the distance function(1) D is
positive and strictly convex(2) D(1) D(1)
0(3) D(1) 1
Linear function
only one iteration required ? fast!
wj lt 0 possible ? negative estimates
Multiplicative function
Converges after a few iterations (typically two)
wj gt 0 always

25
Error bounds

Probabilistic bounds

Pr ( both j and l are in the sample )
26
Experiments

Synthetic data
Skew Zipfian distribution (z0,1,2,3)
Correlation corr. coef. between attributes 0,
1
Real data
Census-Income data from UCI KDD Archive
Population surveys by the US Census Bureau.
200,000 records, 40 attributes
Queries
Range queries attributelt constant
Equality queries attribute constant

27
Effect of correlation
28
Effect of data skew
29
Effect of sample rate
30
Effect of number of attributes
31
Conclusions
Selectivity Estimation
Synopsis-based estimation
Sampling-based estimation
HASE