Title: HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries
1HASE A Hybrid Approach to Selectivity Estimation
for Conjunctive Queries
- Xiaohui Yu
- University of Toronto
- xhyu_at_cs.toronto.edu
- Joint work with Nick Koudas (University of
Toronto) and Calisto Zuzarte (IBM Toronto Lab)
2Outline
- Background
- Motivation
- Related Work
- HASE
- Estimator
- Algorithms
- Bounds
- Experiments
- Conclusions
3Query Optimization
- Execution plans differ in costs
- Difference can be huge (1 sec vs. 1 hour)
- Which Plan to Choose??
- Query Optimization
- Estimate the costs of different plans
- Choose the plan with the least cost
- Cost Estimation
- Factors run-time environments, data properties,
4Selectivity
- Important factor in costing selectivity
- Fraction of records satisfying the predicate (s)
- E.g., 100 out of 10,000 records having salary gt
3000 s 100/10000 0.01 - Selectivity can make a big difference
Plan 2 Index scan
cost
Cost s const2
Plan 1 Table scan
Cost const1
Selectivity (s)
s 0.01
5Related Work
- Two streams
- Synopsis-based
- Sampling-based
- Synopses
- Capture the characteristics of data
- Obtained off-line, used on-line
- E.g., Histograms
6Histograms
1700
1000
1500
800
2500
3500
5000
6000
Salary
Q Selectivity of salarygt3000?
7Synopses pros and cons
- Pros
- Built offline can be used many times
- minimal overhead at selectivity estimation time
- Cons
- Difficult to capture all useful information in a
limited space - Correlation between attributes
8Sampling
Number of records in the table 10,000 Sample
size 100 Number of records having age gt 50 and
salary gt 5500 12 Selectivity estimate 12/100
0.12 True selectivity 0.09
9Sampling pros and cons
- The good
- Provides correlation info through the sample
- The bad
- Cost, cost
- Accurate results require a large portion of the
data to be accessed - Random access is much slower than sequential
access
10Summary
Take the best of both worlds?
Capture correlation reduce sampling rate
11Outline
- Background
- Motivation
- Related Work
- Our approach HASE
- Estimator
- Algorithms
- Bounds
- Experiments
- Conclusions
12HASE
- Hybrid approach to selectivity estimation
Goal Consistent utilization of both sources of
information
- Benefits
- Correlation is captured (sampling)
- Sample size can be significantly smaller
(histograms)
13Problem setting
Data
Table of size N
Conjuncts of predicates Q P1P2P3
(agegt50)(salarygt5500)(hire_dategt01-01-05) P1
P2
P3
Query
- Selectivities of individual predicates (obtained
from synopses) s1 0.1, s2 0.2, s3 0.05 - A Sample S of n records Inclusion probability
of record j ?j For simple random sampling
(SRS) ?j n/N
Available info
Goal
Estimate the selectivity s of the query Q
14Example
Table R with 10,000 records Query Q P1P2 on
two attributes Suppose 500 records satisfy both
predicates True Selectivity s 500/10000 0.05
15Histogram-based estimate
Assuming independence between attributes Selectivi
ty estimate
Based on the histograms, s1 0.6, s2
0.3 Relative error 0.18 0.05 /0.05
260
16Sampling-based estimate
Sample weight of j dj 1/ ?j
Indicator variable
Selectivity Estimate (HT estimator)
Take a SRS of size 100 ? dj 10000/100 100 9
records satisfy Q Estimate 9100/10000
0.09 Relative error 0.05 0.09 / 0.05
80
17A new estimator
Original weights
Known selectivities (through histograms) s1, s2,
New weights
Calibration estimator
wj (1) reproduce known selectivities of
individual predicates (2) as close to dj as
possible
18Consistency with known selectivities
Observed frequencies from sample
100 sample records from 10,000 records in the
table ? dj 100
s1 0.6
19Calibration estimator
Why do we want wj to be as close as dj as
possible?
dj have the property of producing unbiased
estimates
Keep wj as close to dj as possible
wj remain nearly unbiased
20Constrained optimization problem
Distance function D(x) (x wj /dj )
(As close to dj as possible)
w.r.t. wj
Minimize
Subject to
(reproduce known selectivities)
Yes 1
j satisfies Pi?
No 0
21An algorithm based on Newtons method
Method of Lagrange multipliers Minimize
w.r.t.
where
Can be solved using Newtons method via an
iterative procedure. ? ? wj ?
22An alternative algorithm
23Example
Observed frequencies from sample
24Distance measures
- Requirements on the distance function(1) D is
positive and strictly convex(2) D(1) D(1)
0(3) D(1) 1 - Linear function
- only one iteration required ? fast!
- wj lt 0 possible ? negative estimates
- Multiplicative function
- Converges after a few iterations (typically two)
- wj gt 0 always
25Error bounds
Pr ( both j and l are in the sample )
26Experiments
- Synthetic data
- Skew Zipfian distribution (z0,1,2,3)
- Correlation corr. coef. between attributes 0,
1 - Real data
- Census-Income data from UCI KDD Archive
- Population surveys by the US Census Bureau.
- 200,000 records, 40 attributes
- Queries
- Range queries attributelt constant
- Equality queries attribute constant
27Effect of correlation
28Effect of data skew
29Effect of sample rate
30Effect of number of attributes
31Conclusions
Selectivity Estimation
Synopsis-based estimation
Sampling-based estimation
HASE
- The calibrated estimator
- Algorithms
- Probabilistic bounds on errors
- Experimental results
- Benefits
- Capturing correlation (sampling)
- Sample size can be significantly smaller
(histograms)