Query Optimization In Compressed Database Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Query Optimization In Compressed Database Systems

Description:

Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs Why Compression? CPU speed outpaces Disk ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 25
Provided by: zhiyua2
Category:

less

Transcript and Presenter's Notes

Title: Query Optimization In Compressed Database Systems


1
Query Optimization In Compressed Database
Systems
  • Zhiyuan Chen and Johannes Gehrke
  • Cornell University
  • Flip Korn
  • ATT Labs

2
Why Compression?
  • CPU speed outpaces Disk speed exponentially!
  • x10 / decade (bandwidth), x100 / decade (latency)
  • Trade CPU for I/O improve query performance
  • Save bandwidth for sequential I/O
  • Improve buffer pool hit ratio
  • - Pay decompression cost
  • Environment
  • Decision support queries
  • Lossless compression

3
Issues
  • Database compression methods
  • Efficient query processing

4
Database Compression Methods
  • General-purpose compression
  • Only compression ratio matters
  • Large decompression unit (whole file)
  • Database compression
  • Both compression ratio and decompression cost
    matter
  • Small decompression unit (attribute or tuple)

Our setting allow to decompress a single
attribute
5
Efficient Query Processing
  • Compared to uncompressed DB
  • When to decompress
  • Assumption no compression in query processing
  • Our story
  • Different strategies of when to decompress
  • None of them is always optimal
  • Combined optimization problem Query plan
    decompression placement
  • Solutions
  • Experiments

6
Different Decompression Strategies
Eager
R.A S.B
Mem
Disk
R
S
7
Which Strategy Is Optimal?
  • Lazy vs. eager
  • Lazy is always better
  • Transient vs. Lazy
  • Transient more I/O savings
  • Lazy lower decompression cost
  • In practice
  • Numerical attributes transient is always better
  • String attributes no clear winner
  • Expensive to decompress
  • High I/O savings if compressed

8
An Example With TPCH Data
  • Select S_NAME, S_ADDRESS, C_NAME, C_PHONE
  • From Supplier, Customer
  • Where S_ADDRESS C_ADDRESS
  • Order by S_NAME, C_NAME

Sort(S_N, C_N)
S_A C_A
Supplier
Customer
9
Transient vs. Lazy
Lazy sort (7s)
1 attribute compressed
Lazy BNL (2s)
An optimization problem!
10
Interactions With Traditional Optimization
Algorithm run System R, then decide when to
decompress.
Transient sort (3s)
3 attributes compressed
Lazy BNL (2s)
Optimal plan returned by System R is no longer
optimal!
11
Compression Aware Optimization
  • Given a query and a compressed DB Find the
    optimal query plan
  • New operators
  • Explicit decompression operators
  • Transient versions of existing relational
    operators
  • Search space O (nm) factor over old search space
  • n is the depth of the plan
  • m is the number of attributes
  • Each attribute explicitly decompressed at most
    once
  • For each attribute, n places to decompress
    explicitly

12
Dynamic Programming - OPT
  • Extend system R optimizer
  • Bottom up, one minimal plan per interesting
    property
  • What attributes remain compressed as a new
    property

Lazy BNL (2s)Property S_A, C_A uncompressed
Transient SM join (2.5s)Property all compressed
Customer
Supplier
Customer
Supplier
Blowup reduced from nm to 2m
13
Min-K Heuristic Algorithm
  • Store plans for k rather than 2m properties
  • The k properties whose plans are cheapest
  • Storage blowup reduced from 2m to k
  • Time still exponential blowup in the worst case

Join on S_A, C_A
Stored plans
Lazy S_A, C_ATransient S_A, C_ALazy S_A,
transient C_ATransient S_A, Lazy C_A
S_A,
C_A,
14
Min-K Heuristics (2)
  • If transient decompression is bad for one join
    attribute, often so for the other
  • BNL join both S_A and C_A decompressed N2 times
  • Only consider two cases

Stored plans
Join on S_A, C_A
Lazy S_A, C_A Transient S_A, C_A
S_A,
C_A,
  • Time blowup is 2k

15
Experiments
  • Setup
  • Modify Predator query engine optimizer
  • Algorithms
  • Uncompressed, Eager, Lazy, Transient-Only,Two-Ste
    p, OPT, Min-1, Min-2
  • 100 MB TPCH data
  • 50 compression ratio
  • Pentium III 550 Mhz, vary buffer pool size

16
Experimental Setup (2)
  • Randomly add join conditions on string attributes
  • Divide queries into workloads
  • Number of string join conditions, number of join
    tables
  • Metrics for algorithm X
  • Average relative-cost
  • Average(cost of plan returned by X / cost of opt
    plan)
  • Average blowup factor
  • Average( plans searched by X / plans by
    System R)

17
Average Relative Cost
Queries with 3-4 join tables, buffer pool 10 of
compressed DB
18
Distribution of Query Performance
Percentage of Good plans (cost within twice of
OPT) for all queries
19
Optimization Cost
Queries with 3-4 join tables
20
Related Work
  • How to compress
  • RothHorn93, IyerWilhite94, Goldstein98
  • How to query
  • GraefeShapiro91, Westmann00, Greer99
  • Query optimization
  • Compressed MOLAP aggregates Li99
  • Compressed Bitmap indicesAmer-YahiaJohnson00
  • Expensive predicates
  • ChaudhuriShim99, Hellerstein93

21
Conclusions Future Work
  • Novel optimization problem
  • Search for regular query plan when to
    decompress
  • Separate search sub-optimal
  • OPT and Min-K heuristic
  • Up to an order improvement in experiments
  • Future work
  • Caching decompressed values
  • Updates

22
Search Space
Sort(S_A)
  • 3 extended plans (3 is depth)
  • nm blow up over old space
  • n depth of plan
  • m number of attributes

S_A C_A
S_A,
23
Relative-Cost - Varying Buffer Pool Size
Queries with 3- 4 join tables, 2 additional
string joins
24
Relative Performance (2)
Queries with more than 5 join tables
Write a Comment
User Comments (0)
About PowerShow.com