Efficient Bitmap Indexing Techniques for Very Large Datasets - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Bitmap Indexing Techniques for Very Large Datasets

Description:

hist. dst. OID. 53 -29.08 .46. 317. 1415. 1243. 1. 8 -6.754 .53. 281. 1533. 1285. 2. 48 -26.40. ... Direct numerical simulation of auto-ignition process ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 20
Provided by: joh5150
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Efficient Bitmap Indexing Techniques for Very Large Datasets


1
Efficient Bitmap Indexing Techniques for Very
Large Datasets
  • Kesheng John Wu
  • Ekow Otoo
  • Arie Shoshani

2
Problem Statement
  • Main objective maps logical requests to
    qualified objects
  • A logical request
  • 20001015lteventTime 200ltenergylt300
  • Objects
  • Set of object ids
  • Set of files containing the objects
  • Offsets within the files,

3
Application STAR
OID dst hist mEventNumber mEventTime mRunNumber NLb
0 159625 159627 2635 20000827.011759 1239029 1341
1 159625 159627 2636 20000827.011759 1239029 1470
2 159625 159627 2637 20000827.011759 1239029 1663
OID n_clus_tpc_in13 numberOfPrimaryTracks ChargedParticles_Means1 PrimaryVertexX qxb2 zdc2Energy
0 909 1228 266 .56 -26.40 48
1 1243 1415 317 .46 -29.08 53
2 1285 1533 281 .53 -6.754 8
A portion of the STAR tag dataset 3 events with
12 attributes from millions of events with 502
attributes.
4
Application Combustion
  • Direct numerical simulation of auto-ignition
    process (solution of complex partial differential
    equations)
  • A dozen or more variables are computed at each
    time step and each grid point
  • Number of grid points 2D 600 X 600 gtgtgt 3D 1000 X
    1000 X 1000
  • Time steps 100 gtgtgt 1000s
  • Data size 1 GB gtgtgt 10 TB
  • Task identify features and track them across
    time steps
  • E.G. Find flame front across time
  • Find 600lttemplt700 for 1 billion points per time
    step, and discover overlap between time steps
  • Use compressed bitmaps to accelerate both feature
    extraction and feature tracking

1000 X 1000 X 1000
5
Building a Bitmap Index
  • Partition each property into bins (binning)
  • e.g. for 0ltNLblt4000, 20 equal size bins 0,
    200)200,400)
  • Generate a bit vector for each bin (encoding)
  • Bit i of bit vector j is 1 iff NLbi is in bin j
  • Compress each bit vector

6
Advantages of Bitmap Index
  • Bitmap index specialized index that takes
    advantage
  • Read-mostly data data produced from scientific
    experiments can be appended in large groups
  • Fast operations
  • Predicate queries can be performed with bitwise
    logical operations
  • Predicate ops , lt, gt, lt, gt, range,
  • Logical ops AND, OR, XOR, NOT
  • They are well supported by hardware
  • Easy to compress, potentially small index size
  • Each individual bitmap is small and frequently
    used ones can be cached in memory

7
Operation-efficient Compression Methods
  • Best known byte-aligned bitmap code (BBC)
  • Uses run-length encoding (next slide)
  • Byte alignment, optimized for space efficiency
  • Decoding on bit level, not optimal for operations
  • Used in oracle
  • We developed a new word-aligned scheme WAH
  • Uses run-length encoding
  • Word alignment
  • Designed for minimal decoding to gain speed

8
Operation-efficient Compression Methods
Based on variations of Run Length Compression
Uncompressed 0000000000001111000000000
......0000001000000001111111100000000 ....
000000 Compressed 12, 4, 1000,1,8,1000 Store
very short sequences as-is
Advantage Can perform AND, OR, COUNT
operations on compressed data
9
Trade-off of Compression Schemes
uncompressed
WAH
10
Information About the Test Machines
  • Hardware and system
  • Sun enterprise 450 (Ultrasparc II 400mhz)
  • 4GB RAM
  • VARITAS volume manager (stripped disk)
  • Real application data from STAR
  • Above 2 million objects, 12 attributes
  • Synthetic data
  • 100 million objects, 10 attributes
  • Terms
  • Compression ratio ratio of compressed bitmaps
    size and uncompressed bitmaps size
  • Time reported are wall clock time in seconds

11
Logical Operation Time(Synthetic Data) 10X
improvement
12
Logical Operation Time (STAR Data)Also 10X
improvement
13
Encoding Schemes Main Idea
Interval encoding
Range encoding
Equality encoding
12 bins
1
2
3
4
5
6
7
8
9
10
11
12
Interval, Range encoding operates on 2 bins only!
14
Total Effect of Compression and Encoding Schemes
  • Bottom line on queries
  • Compression scheme determines efficiency of
    logical operations
  • Encoding scheme determines number of operations
  • Range interval only one logical operation
    over 2 bitmaps
  • Equality many operations depending on number of
    bins
  • But, space may be a consideration
  • What is the trade-off?

15
Interval Encoding Is Better Overall(WAH
Compression)
Points on the graphs represent 10, 20, 30, 50,
100 Bins.
Average time for random range queries
16
Timing Results
Method Index (X data) Time (sec) Speed
ORACLE Scan 0 6 0.1
ORACLE B-tree 3.6 0.95 0.6
Native vertical partition Scan 0 0.57 1
Native vertical partition 20 bins 0.18 0.11 5
Native vertical partition 50 bins 0.43 0.07 8
Native vertical partition 100 bins 0.90 0.05 11
17
Summary
  • Compressed bitmap indices are effective for range
    queries
  • Better compression scheme
  • 50 more space, but 12 time faster !!!
  • Among the different encoding schemes
  • The interval encoding is the overall winner

18
Future Work
  • Support NULL value and categorical values
  • On-line update add new data and update index
    without interrupting request processing
  • Recovery mechanism for robustness
  • Potential new applications climate,
    astrophysics, biology (microarrays)
  • Study non-uniform binning strategies
  • Study more encoding schemes
  • Integrate with conventional database system to
    better handle metadata, to provide more versatile
    front-end

? ? ? ?
19
How Many Bins for Continuous Domains?
More bins Less objects in edge bins
Searching edge bins skip-scan over attribute
vertical partition
Write a Comment
User Comments (0)
About PowerShow.com