Strategies for processing ad hoc queries - PowerPoint PPT Presentation

About This Presentation

Title:

Strategies for processing ad hoc queries

Description:

We have shown that BBC is CPU-bound rather than I/O-bound as assumed in the past. WAH is much more (10X) CPU-efficient than BBC ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 25

Provided by: joh51

Learn more at: http://web.cs.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Strategies for processing ad hoc queries

1
Strategies for ProcessingAd Hoc Querieson Large
Data Warehouses
Presented by Fan Wu Instructor Prof. Elke
Rundensteiner April 8, 2004
2
Outline

Motivation for designing software
Many large scientific data warehouses need to
process ad hoc queries
Lack of efficient indices
Issues to discuss
Vertical partitioning
Bitmap index
Compression how to store the bitmaps
Persistent storage where to store the bitmaps

3
Example High-Energy Physics Experiment STAR

Current data size
20 million collision events
each event 10 KB in size
Production data rate
100 million records / year
1 TB per year
Scientists may query any of the 500 or so
attributes
Each query may involve conditions on 5 8
attributes
Energy gt 100 Particles gt 500
Near real-time evaluation desired

4
Many Scientific Applications Involve Large
Datasets

Sloan Digital Sky Survey http//www.sdss.org
Earth Observing System http//eos.nasa.gov
Large Hadron Collider http//lhc.web.cern.ch
Genomes to life http//doegenomestolife.org
Combustion http//scidac.psc.edu
PCMDI http//www-pcmdi.llnl.gov

5
Searching and Indexing Requirements

Some common features of the large scientific
datasets
Read-mostly data warehouses
Large high-dimensional data millions or billions
of records, each record with tens or hundreds of
attributes
Many queries are high-dimensional partial range
queries
Most users desire to modify queries interactively
Existing database software not specialized for
these tasks slow
Need new special purpose software
BMI bitmap index, CERN
IBIS independent bitmap index and search, LBNL

6
Issues to Be Discussed

Organization of the primary data, i.e., the user
data
Viewing the primary data as a 2-D table
Horizontal partition used in transactional
systems
Vertical partition good for partial range
queries
Indexing strategies
Tree based schemes not effective for dimensions
gt 10
Bitmap index well suited for partial range
queries
Storage scheme for the index data
BMI Store bitmaps as objects in an
object-oriented database (ODBMS)
IBIS Store bitmaps as simple files

7
Horizontal vs. Vertical Partitioning

Horizontal partitioning
Data elements of a record are stored
consecutively
Good for accessing one record at a time
Used in relational DBMS systems where records are
frequently updated
Typically 6070 of bytes of each page is used

Vertical partitioning
All records of an attribute are stored
consecutively
Good for accessing multiple records by attribute
selection
Suitable for data warehousing systems where
records are rarely modified
May use 100 of bytes of each page

8
Performance Advantage of Vertical Partitioning

Experiment with 2.2 million records of STAR data
(10 attributes only)
The figure on the right shows the time to search
without an index
Query box size is the relative volume of the
hypercube formed by range conditions
The disk system supports about 20 MB/s sustained
reading
For answering a query like A gt 5, the time used
by a relational DBMS is proportional to number of
attributes in the table
500 attributes, 500 times slower

Vertical partitioning is effective for partial
range queries
9
Brief Overview of Index Data Structures

One dimensional index data structures
Total order for one-dimension
Hash-based Optimized for exact match queries,
e.g. E 106
Tree-based Optimized for range queries, e.g. E lt
106
Most widely used B-tree (1972)
Multidimensional index data structures
No total order for all dimensions
Hash-based Grid-File, Bang-File,
Tree based R-Trees, Pyramid-Tree,
Bitmap Indices Effective for data warehousing
environments
Linearize to introduce total order, then use
one-dimensional indices

10
Basic Bitmap Index
a) List of attributes b) Bitmap Index (equality
encoding)
Bit Slice E2 encodesattributes with value 2
a) List of 12 attributes with 10 distinct
attribute values, i.e attribute cardinality 10
b) For each distinct attribute value, one bit
slice is created, i.e bitmap index consists of 10
bitmaps (E0 to E9)
11
Pros and Cons of Bitmap Indices

Pros
Easy to build and to maintain
Easy to identify records that satisfy a complex
multi-attribute predicate (multi-dimensional
ad-hoc queries)
Very space efficient for attributes with low
cardinality (number of distinct attribute values,
e.g. Yes, No)
Cons
Space inefficient for attributes with high
cardinality
An effective strategy Bitmap Compression
Other strategies binning, encoding

12
Bitmap Compression

Advantages
Less disk space for storing indices
Indices can be read from disk faster
More indices can be cached in memory
Possible problems
Increases the complexity of the software
If bitmaps must be decompressed before performing
Boolean operations, the decompression overhead
might outweigh the advantages of compression
Use compression schemes that work directly on
compressed data

13
Various Bitmap Compression Algorithms

Run Length Encoding (RLE)
one-sided (asymmetric) vs. two-sided (symmetric)
Gzip (Lempel-Ziv, LZ)
verbatim (uncompressed) bitmap is compressed via
zlib
ExpGol
Variable bit length encoding (RLE-bitmap is
compressed)
Byte-Aligned Bitmap Compression (BBC)
Variable byte length encoding (Oracle patent)
One-sided vs. two-sided (BBC1 vs. BBC2)
Word-Aligned Hybrid (WAH)
Fixed word based encoding

14
Relative Strength of Different Compression Schemes
15
WAH Compression Bitmap Index Implementations

Compression Schemes
Designed for reducing the CPU-complexity of
logical operations when compared to BBC, 10 X
speedup
However, lower compression factor, i.e. the sizes
of the WAH-compressed bitmaps are some 40-60
larger than BBC-compressed bitmaps
Storage scheme
BMI Bitmap Index implementation on top of ODBMS
(CERN)
IBIS Bitmap Index implementation based on plain
files (LBL)

16
Test Setup

Real application data (STAR) 2.2 million
records
Synthetic dataset I 100 million records
Synthetic dataset II 5 million records
Only the performance of the bitwise logical
operation AND is reported
Other logical operations such as OR, XOR, etc.
show similar relative differences
Most of the benchmarks were executed on three
different machines with various CPU and I/O
subsystems

17
In Memory Logical OperationAND
On dm, 450MHz UltraSPARC
WAH is always the fastest, 2X 20X
18
Search Time (Including File IO)
On dm, 20MB/s IO
On tin, 2MB/s IO
To answer the queries read two bitmaps from
files, perform one logical AND Unless using a
very slow disk, it is worth-while to use WAH
compression
19
With BBC, Searching Operation Spends Little Time
in IO
On dm, 20MB/s IO
On tin, 2MB/s IO

The percentage of time spent in IO on different
bitmaps
This percentage is expected to be high, but it is
actually low with BBC
WAH reduce CPU time, and searching is again IO
bound

20
Sizes of Compressed Bitmaps
BBC-s simplified (LBL)BBC-f full (ATT CERN)
The total size of a bitmap index compressed with
WAH is typically 40-60 larger than that
compressed with BBC
21
Sizes of Compressed Bitmaps

The figure on the right plot the maximum size of
the bitmap index against the attribute
cardinality of an attribute with 100 million
(108) records
In the worst case, the size of the compressed
bitmap index is about 400 million words, 4 times
the size of the primary data
For most high-cardinality attributes, the
compressed bitmap index size is smaller than that
of a typical B-tree index( 3X primary data)

B-tree
The compressed bitmap index sizes are usually
smaller than B-tree
22
Query PerformanceIBIS vs. RDBMS

Accessing bitmaps in files (IBIS) has about the
same efficiency as accessing bitmaps within an
RDBMS
The DBMS tested uses a BBC compressed bitmap
index similar to BBC compressed index
Used real application data

Size(MB) Create(sec) Query(sec)
IBIS WAH 166 91 0.7
IBIS BBC-s 117 116 2.9
RDBMS 123(247) 2890 3.1
WAH compressed index is 4X more efficient than
BBC compressed index
23
Query PerformanceFile (IBIS) vs. ODBMS (BMI)

Figures on the left time needed to process
5-dimensional queries on tin
Queries on synthetic data
IBIS with WAH uses the least amount of time
ODBMS overhead 4X
Due to file system caching, IBIS is 10X faster
on files that have been accessed before (warm
files)

b) warm files
a) cold files
24
Conclusions

We have shown that BBC is CPU-bound rather than
I/O-bound as assumed in the past
WAH is much more (10X) CPU-efficient than BBC
Building bitmap indices on top of ODBMS
introduces about 4X overhead when compared to
using plain files
Building bitmap indices inside DBMS (as in many
commercial systems) shows higher efficiency
Processing multi-dimensional range queries is
efficient with WAH compressed bitmap indices
Read-only data should be vertically partitioned