Title: Efficient Bitmap Indexing Techniques for Very Large Datasets
1Efficient Bitmap Indexing Techniques for Very
Large Datasets
- Kesheng John Wu
- Ekow Otoo
- Arie Shoshani
2Problem Statement
- Main objective maps logical requests to
qualified objects - A logical request
- 20001015lteventTime 200ltenergylt300
- Objects
- Set of object ids
- Set of files containing the objects
- Offsets within the files,
3Application STAR
OID dst hist mEventNumber mEventTime mRunNumber NLb
0 159625 159627 2635 20000827.011759 1239029 1341
1 159625 159627 2636 20000827.011759 1239029 1470
2 159625 159627 2637 20000827.011759 1239029 1663
OID n_clus_tpc_in13 numberOfPrimaryTracks ChargedParticles_Means1 PrimaryVertexX qxb2 zdc2Energy
0 909 1228 266 .56 -26.40 48
1 1243 1415 317 .46 -29.08 53
2 1285 1533 281 .53 -6.754 8
A portion of the STAR tag dataset 3 events with
12 attributes from millions of events with 502
attributes.
4Application Combustion
- Direct numerical simulation of auto-ignition
process (solution of complex partial differential
equations) - A dozen or more variables are computed at each
time step and each grid point - Number of grid points 2D 600 X 600 gtgtgt 3D 1000 X
1000 X 1000 - Time steps 100 gtgtgt 1000s
- Data size 1 GB gtgtgt 10 TB
- Task identify features and track them across
time steps - E.G. Find flame front across time
- Find 600lttemplt700 for 1 billion points per time
step, and discover overlap between time steps - Use compressed bitmaps to accelerate both feature
extraction and feature tracking
1000 X 1000 X 1000
5Building a Bitmap Index
- Partition each property into bins (binning)
- e.g. for 0ltNLblt4000, 20 equal size bins 0,
200)200,400) - Generate a bit vector for each bin (encoding)
- Bit i of bit vector j is 1 iff NLbi is in bin j
- Compress each bit vector
6Advantages of Bitmap Index
- Bitmap index specialized index that takes
advantage - Read-mostly data data produced from scientific
experiments can be appended in large groups - Fast operations
- Predicate queries can be performed with bitwise
logical operations - Predicate ops , lt, gt, lt, gt, range,
- Logical ops AND, OR, XOR, NOT
- They are well supported by hardware
- Easy to compress, potentially small index size
- Each individual bitmap is small and frequently
used ones can be cached in memory
7Operation-efficient Compression Methods
- Best known byte-aligned bitmap code (BBC)
- Uses run-length encoding (next slide)
- Byte alignment, optimized for space efficiency
- Decoding on bit level, not optimal for operations
- Used in oracle
- We developed a new word-aligned scheme WAH
- Uses run-length encoding
- Word alignment
- Designed for minimal decoding to gain speed
8Operation-efficient Compression Methods
Based on variations of Run Length Compression
Uncompressed 0000000000001111000000000
......0000001000000001111111100000000 ....
000000 Compressed 12, 4, 1000,1,8,1000 Store
very short sequences as-is
Advantage Can perform AND, OR, COUNT
operations on compressed data
9Trade-off of Compression Schemes
uncompressed
WAH
10Information About the Test Machines
- Hardware and system
- Sun enterprise 450 (Ultrasparc II 400mhz)
- 4GB RAM
- VARITAS volume manager (stripped disk)
- Real application data from STAR
- Above 2 million objects, 12 attributes
- Synthetic data
- 100 million objects, 10 attributes
- Terms
- Compression ratio ratio of compressed bitmaps
size and uncompressed bitmaps size - Time reported are wall clock time in seconds
11Logical Operation Time(Synthetic Data) 10X
improvement
12Logical Operation Time (STAR Data)Also 10X
improvement
13Encoding Schemes Main Idea
Interval encoding
Range encoding
Equality encoding
12 bins
1
2
3
4
5
6
7
8
9
10
11
12
Interval, Range encoding operates on 2 bins only!
14Total Effect of Compression and Encoding Schemes
- Bottom line on queries
- Compression scheme determines efficiency of
logical operations - Encoding scheme determines number of operations
- Range interval only one logical operation
over 2 bitmaps - Equality many operations depending on number of
bins - But, space may be a consideration
- What is the trade-off?
15Interval Encoding Is Better Overall(WAH
Compression)
Points on the graphs represent 10, 20, 30, 50,
100 Bins.
Average time for random range queries
16Timing Results
Method Index (X data) Time (sec) Speed
ORACLE Scan 0 6 0.1
ORACLE B-tree 3.6 0.95 0.6
Native vertical partition Scan 0 0.57 1
Native vertical partition 20 bins 0.18 0.11 5
Native vertical partition 50 bins 0.43 0.07 8
Native vertical partition 100 bins 0.90 0.05 11
17Summary
- Compressed bitmap indices are effective for range
queries - Better compression scheme
- 50 more space, but 12 time faster !!!
- Among the different encoding schemes
- The interval encoding is the overall winner
18Future Work
- Support NULL value and categorical values
- On-line update add new data and update index
without interrupting request processing - Recovery mechanism for robustness
- Potential new applications climate,
astrophysics, biology (microarrays) - Study non-uniform binning strategies
- Study more encoding schemes
- Integrate with conventional database system to
better handle metadata, to provide more versatile
front-end
? ? ? ?
19How Many Bins for Continuous Domains?
More bins Less objects in edge bins
Searching edge bins skip-scan over attribute
vertical partition