Strategies for processing ad hoc queries - PowerPoint PPT Presentation

About This Presentation
Title:

Strategies for processing ad hoc queries

Description:

We have shown that BBC is CPU-bound rather than I/O-bound as assumed in the past. WAH is much more (10X) CPU-efficient than BBC ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 25
Provided by: joh51
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Strategies for processing ad hoc queries


1
Strategies for ProcessingAd Hoc Querieson Large
Data Warehouses
Presented by Fan Wu Instructor Prof. Elke
Rundensteiner April 8, 2004
2
Outline
  • Motivation for designing software
  • Many large scientific data warehouses need to
    process ad hoc queries
  • Lack of efficient indices
  • Issues to discuss
  • Vertical partitioning
  • Bitmap index
  • Compression how to store the bitmaps
  • Persistent storage where to store the bitmaps

3
Example High-Energy Physics Experiment STAR
  • Current data size
  • 20 million collision events
  • each event 10 KB in size
  • Production data rate
  • 100 million records / year
  • 1 TB per year
  • Scientists may query any of the 500 or so
    attributes
  • Each query may involve conditions on 5 8
    attributes
  • Energy gt 100 Particles gt 500
  • Near real-time evaluation desired

4
Many Scientific Applications Involve Large
Datasets
  • Sloan Digital Sky Survey http//www.sdss.org
  • Earth Observing System http//eos.nasa.gov
  • Large Hadron Collider http//lhc.web.cern.ch
  • Genomes to life http//doegenomestolife.org
  • Combustion http//scidac.psc.edu
  • PCMDI http//www-pcmdi.llnl.gov

5
Searching and Indexing Requirements
  • Some common features of the large scientific
    datasets
  • Read-mostly data warehouses
  • Large high-dimensional data millions or billions
    of records, each record with tens or hundreds of
    attributes
  • Many queries are high-dimensional partial range
    queries
  • Most users desire to modify queries interactively
  • Existing database software not specialized for
    these tasks slow
  • Need new special purpose software
  • BMI bitmap index, CERN
  • IBIS independent bitmap index and search, LBNL

6
Issues to Be Discussed
  • Organization of the primary data, i.e., the user
    data
  • Viewing the primary data as a 2-D table
  • Horizontal partition used in transactional
    systems
  • Vertical partition good for partial range
    queries
  • Indexing strategies
  • Tree based schemes not effective for dimensions
    gt 10
  • Bitmap index well suited for partial range
    queries
  • Storage scheme for the index data
  • BMI Store bitmaps as objects in an
    object-oriented database (ODBMS)
  • IBIS Store bitmaps as simple files

7
Horizontal vs. Vertical Partitioning
  • Horizontal partitioning
  • Data elements of a record are stored
    consecutively
  • Good for accessing one record at a time
  • Used in relational DBMS systems where records are
    frequently updated
  • Typically 6070 of bytes of each page is used
  • Vertical partitioning
  • All records of an attribute are stored
    consecutively
  • Good for accessing multiple records by attribute
    selection
  • Suitable for data warehousing systems where
    records are rarely modified
  • May use 100 of bytes of each page

8
Performance Advantage of Vertical Partitioning
  • Experiment with 2.2 million records of STAR data
    (10 attributes only)
  • The figure on the right shows the time to search
    without an index
  • Query box size is the relative volume of the
    hypercube formed by range conditions
  • The disk system supports about 20 MB/s sustained
    reading
  • For answering a query like A gt 5, the time used
    by a relational DBMS is proportional to number of
    attributes in the table
  • 500 attributes, 500 times slower

Vertical partitioning is effective for partial
range queries
9
Brief Overview of Index Data Structures
  • One dimensional index data structures
  • Total order for one-dimension
  • Hash-based Optimized for exact match queries,
    e.g. E 106
  • Tree-based Optimized for range queries, e.g. E lt
    106
  • Most widely used B-tree (1972)
  • Multidimensional index data structures
  • No total order for all dimensions
  • Hash-based Grid-File, Bang-File,
  • Tree based R-Trees, Pyramid-Tree,
  • Bitmap Indices Effective for data warehousing
    environments
  • Linearize to introduce total order, then use
    one-dimensional indices

10
Basic Bitmap Index
a) List of attributes b) Bitmap Index (equality
encoding)
Bit Slice E2 encodesattributes with value 2
a) List of 12 attributes with 10 distinct
attribute values, i.e attribute cardinality 10
b) For each distinct attribute value, one bit
slice is created, i.e bitmap index consists of 10
bitmaps (E0 to E9)
11
Pros and Cons of Bitmap Indices
  • Pros
  • Easy to build and to maintain
  • Easy to identify records that satisfy a complex
    multi-attribute predicate (multi-dimensional
    ad-hoc queries)
  • Very space efficient for attributes with low
    cardinality (number of distinct attribute values,
    e.g. Yes, No)
  • Cons
  • Space inefficient for attributes with high
    cardinality
  • An effective strategy Bitmap Compression
  • Other strategies binning, encoding

12
Bitmap Compression
  • Advantages
  • Less disk space for storing indices
  • Indices can be read from disk faster
  • More indices can be cached in memory
  • Possible problems
  • Increases the complexity of the software
  • If bitmaps must be decompressed before performing
    Boolean operations, the decompression overhead
    might outweigh the advantages of compression
  • Use compression schemes that work directly on
    compressed data

13
Various Bitmap Compression Algorithms
  • Run Length Encoding (RLE)
  • one-sided (asymmetric) vs. two-sided (symmetric)
  • Gzip (Lempel-Ziv, LZ)
  • verbatim (uncompressed) bitmap is compressed via
    zlib
  • ExpGol
  • Variable bit length encoding (RLE-bitmap is
    compressed)
  • Byte-Aligned Bitmap Compression (BBC)
  • Variable byte length encoding (Oracle patent)
  • One-sided vs. two-sided (BBC1 vs. BBC2)
  • Word-Aligned Hybrid (WAH)
  • Fixed word based encoding

14
Relative Strength of Different Compression Schemes
15
WAH Compression Bitmap Index Implementations
  • Compression Schemes
  • Designed for reducing the CPU-complexity of
    logical operations when compared to BBC, 10 X
    speedup
  • However, lower compression factor, i.e. the sizes
    of the WAH-compressed bitmaps are some 40-60
    larger than BBC-compressed bitmaps
  • Storage scheme
  • BMI Bitmap Index implementation on top of ODBMS
    (CERN)
  • IBIS Bitmap Index implementation based on plain
    files (LBL)

16
Test Setup
  • Real application data (STAR) 2.2 million
    records
  • Synthetic dataset I 100 million records
  • Synthetic dataset II 5 million records
  • Only the performance of the bitwise logical
    operation AND is reported
  • Other logical operations such as OR, XOR, etc.
    show similar relative differences
  • Most of the benchmarks were executed on three
    different machines with various CPU and I/O
    subsystems

17
In Memory Logical OperationAND
On dm, 450MHz UltraSPARC
WAH is always the fastest, 2X 20X
18
Search Time (Including File IO)
On dm, 20MB/s IO
On tin, 2MB/s IO
To answer the queries read two bitmaps from
files, perform one logical AND Unless using a
very slow disk, it is worth-while to use WAH
compression
19
With BBC, Searching Operation Spends Little Time
in IO
On dm, 20MB/s IO
On tin, 2MB/s IO
  • The percentage of time spent in IO on different
    bitmaps
  • This percentage is expected to be high, but it is
    actually low with BBC
  • WAH reduce CPU time, and searching is again IO
    bound

20
Sizes of Compressed Bitmaps
BBC-s simplified (LBL)BBC-f full (ATT CERN)
The total size of a bitmap index compressed with
WAH is typically 40-60 larger than that
compressed with BBC
21
Sizes of Compressed Bitmaps
  • The figure on the right plot the maximum size of
    the bitmap index against the attribute
    cardinality of an attribute with 100 million
    (108) records
  • In the worst case, the size of the compressed
    bitmap index is about 400 million words, 4 times
    the size of the primary data
  • For most high-cardinality attributes, the
    compressed bitmap index size is smaller than that
    of a typical B-tree index( 3X primary data)

B-tree
The compressed bitmap index sizes are usually
smaller than B-tree
22
Query PerformanceIBIS vs. RDBMS
  • Accessing bitmaps in files (IBIS) has about the
    same efficiency as accessing bitmaps within an
    RDBMS
  • The DBMS tested uses a BBC compressed bitmap
    index similar to BBC compressed index
  • Used real application data

Size(MB) Create(sec) Query(sec)
IBIS WAH 166 91 0.7
IBIS BBC-s 117 116 2.9
RDBMS 123(247) 2890 3.1
WAH compressed index is 4X more efficient than
BBC compressed index
23
Query PerformanceFile (IBIS) vs. ODBMS (BMI)
  • Figures on the left time needed to process
    5-dimensional queries on tin
  • Queries on synthetic data
  • IBIS with WAH uses the least amount of time
  • ODBMS overhead 4X
  • Due to file system caching, IBIS is 10X faster
    on files that have been accessed before (warm
    files)

b) warm files
a) cold files
24
Conclusions
  • We have shown that BBC is CPU-bound rather than
    I/O-bound as assumed in the past
  • WAH is much more (10X) CPU-efficient than BBC
  • Building bitmap indices on top of ODBMS
    introduces about 4X overhead when compared to
    using plain files
  • Building bitmap indices inside DBMS (as in many
    commercial systems) shows higher efficiency
  • Processing multi-dimensional range queries is
    efficient with WAH compressed bitmap indices
  • Read-only data should be vertically partitioned
Write a Comment
User Comments (0)
About PowerShow.com