Title: Query Optimization In Compressed Database Systems
1Query Optimization In Compressed Database
Systems
- Zhiyuan Chen and Johannes Gehrke
- Cornell University
- Flip Korn
- ATT Labs
2Why Compression?
- CPU speed outpaces Disk speed exponentially!
- x10 / decade (bandwidth), x100 / decade (latency)
- Trade CPU for I/O improve query performance
- Save bandwidth for sequential I/O
- Improve buffer pool hit ratio
- - Pay decompression cost
- Environment
- Decision support queries
- Lossless compression
3Issues
- Database compression methods
- Efficient query processing
4Database Compression Methods
- General-purpose compression
- Only compression ratio matters
- Large decompression unit (whole file)
- Database compression
- Both compression ratio and decompression cost
matter - Small decompression unit (attribute or tuple)
Our setting allow to decompress a single
attribute
5Efficient Query Processing
- Compared to uncompressed DB
- When to decompress
- Assumption no compression in query processing
- Our story
- Different strategies of when to decompress
- None of them is always optimal
- Combined optimization problem Query plan
decompression placement - Solutions
- Experiments
6Different Decompression Strategies
Eager
R.A S.B
Mem
Disk
R
S
7Which Strategy Is Optimal?
- Lazy vs. eager
- Lazy is always better
- Transient vs. Lazy
- Transient more I/O savings
- Lazy lower decompression cost
- In practice
- Numerical attributes transient is always better
- String attributes no clear winner
- Expensive to decompress
- High I/O savings if compressed
8An Example With TPCH Data
- Select S_NAME, S_ADDRESS, C_NAME, C_PHONE
- From Supplier, Customer
- Where S_ADDRESS C_ADDRESS
- Order by S_NAME, C_NAME
Sort(S_N, C_N)
S_A C_A
Supplier
Customer
9Transient vs. Lazy
Lazy sort (7s)
1 attribute compressed
Lazy BNL (2s)
An optimization problem!
10Interactions With Traditional Optimization
Algorithm run System R, then decide when to
decompress.
Transient sort (3s)
3 attributes compressed
Lazy BNL (2s)
Optimal plan returned by System R is no longer
optimal!
11Compression Aware Optimization
- Given a query and a compressed DB Find the
optimal query plan - New operators
- Explicit decompression operators
- Transient versions of existing relational
operators - Search space O (nm) factor over old search space
- n is the depth of the plan
- m is the number of attributes
- Each attribute explicitly decompressed at most
once - For each attribute, n places to decompress
explicitly
12Dynamic Programming - OPT
- Extend system R optimizer
- Bottom up, one minimal plan per interesting
property - What attributes remain compressed as a new
property
Lazy BNL (2s)Property S_A, C_A uncompressed
Transient SM join (2.5s)Property all compressed
Customer
Supplier
Customer
Supplier
Blowup reduced from nm to 2m
13Min-K Heuristic Algorithm
- Store plans for k rather than 2m properties
- The k properties whose plans are cheapest
- Storage blowup reduced from 2m to k
- Time still exponential blowup in the worst case
Join on S_A, C_A
Stored plans
Lazy S_A, C_ATransient S_A, C_ALazy S_A,
transient C_ATransient S_A, Lazy C_A
S_A,
C_A,
14Min-K Heuristics (2)
- If transient decompression is bad for one join
attribute, often so for the other - BNL join both S_A and C_A decompressed N2 times
Stored plans
Join on S_A, C_A
Lazy S_A, C_A Transient S_A, C_A
S_A,
C_A,
15Experiments
- Setup
- Modify Predator query engine optimizer
- Algorithms
- Uncompressed, Eager, Lazy, Transient-Only,Two-Ste
p, OPT, Min-1, Min-2 - 100 MB TPCH data
- 50 compression ratio
- Pentium III 550 Mhz, vary buffer pool size
16Experimental Setup (2)
- Randomly add join conditions on string attributes
- Divide queries into workloads
- Number of string join conditions, number of join
tables - Metrics for algorithm X
- Average relative-cost
- Average(cost of plan returned by X / cost of opt
plan) - Average blowup factor
- Average( plans searched by X / plans by
System R)
17Average Relative Cost
Queries with 3-4 join tables, buffer pool 10 of
compressed DB
18Distribution of Query Performance
Percentage of Good plans (cost within twice of
OPT) for all queries
19Optimization Cost
Queries with 3-4 join tables
20Related Work
- How to compress
- RothHorn93, IyerWilhite94, Goldstein98
- How to query
- GraefeShapiro91, Westmann00, Greer99
- Query optimization
- Compressed MOLAP aggregates Li99
- Compressed Bitmap indicesAmer-YahiaJohnson00
- Expensive predicates
- ChaudhuriShim99, Hellerstein93
21Conclusions Future Work
- Novel optimization problem
- Search for regular query plan when to
decompress - Separate search sub-optimal
- OPT and Min-K heuristic
- Up to an order improvement in experiments
- Future work
- Caching decompressed values
- Updates
22Search Space
Sort(S_A)
- 3 extended plans (3 is depth)
-
- nm blow up over old space
- n depth of plan
- m number of attributes
S_A C_A
S_A,
23Relative-Cost - Varying Buffer Pool Size
Queries with 3- 4 join tables, 2 additional
string joins
24Relative Performance (2)
Queries with more than 5 join tables