Title: A Quantization Based Framework for Scientific Data Management
1A Quantization Based Framework for Scientific
Data Management
- Guadalupe Canahuate
- The Ohio State University
- Adviser Hakan Ferhatosmanoglu
2HEP Example
Online System
Data Files
Workstations
3Bitmap Indices
- Data is quantized into b categories using b bits
- Each tuple is encoded based on which category its
attribute belongs to - Bitmap Equality Encoded (BEE)
- (Projection Index, Binning)
- suited for point queries
- Bitmap Range Encoded (BRE)
- suited for range queries
- Fast Bit Operations over bit vectors (AND, OR,
NOT, XOR)
4Bitmap Compression
- Run-length encoders
- Runs of 0s ? (0, run-count)
- Byte-aligned Bitmap Code (BBC) Oracle 94
- Word-Aligned Hybrid Code (WAH) LBNL 02
- No need to decompress for query processing
- Full scan of the bitmap is needed
5Applications
- Data warehouses
- Scientific data
- High-energy physics Simulations are continuously
run, and notable events are stored with all the
details. - Climate modeling sensor data.
- Astro-physics telescopes devoted for
observations. - Visualization applications
Supernova Observatory
6Visualization Applications
- Millions of different readings ordered by their
geographic location - Users ask range queries over some of the readings
for a given area - The answers are highlighted in the screen
- Several degrees of resolution make approximate
answers acceptable
7Our Goal
- Build a quantization framework
- Handle the intrinsic characteristics of
scientific data - Missing data
- Large Data Volume
- Efficiently support current queries
- Subsetting
- Support new types of queries over the quantized
data - Nearest Neighbor Queries
8Framework Overview
9Handling the characteristics of the data
Missing Data Handler Query Semantic Component
Guadalupe Canahuate, Michael Gibas, Hakan
Ferhatosmanoglu, "Indexing Incomplete Databases",
International Conference on Extending Data Base
Technology (EDBT) , Munich, Germany, 2006/March,
pp. 884-901
10Motivation
- Incomplete databases are common in all research
and industry domains - Missing values can be filled in by using
statistical analysis - However, in many cases, missingness is not
ignorable - We want to be able to distinguish between the
real values and the absence of such values.
11Dealing with Missing Data
- NULL values need to be translated the into
indexable values - Distinguished value outside the attribute domain
- Map missing to 0 or -1
- High of missing data results in a highly skewed
data - Traditional Hierarchical Multidimensional Indexes
break down - Query needs to be translated into exponential
number of subqueries - R-Trees 2d, 10 Missing, 23 times worse
12Query Semantics
- Missing is a match
- In a patient database retrieve the records whose
first name is John and middle name Paul. - Missing is not a match
- In a census database retrieve the interviewee
that answered C in question 4 and A in
question 8
13Quantization
- Avoid decomposing the query into exponential
number of subqueries - Map missing values to a distinguished value in
the index - Performance does not degrade with the skewness
introduced by the mapping function
14Bitmaps EEHandling Missing Data
Equality Encoded
- No degradation in compression
- Need to be able to identify the missing values
- Construct a bitmap for missing values
- NOT operation over non-missing bitmaps includes
missing records
15Bitmaps EEQuerying Missing Data
16Bitmaps REHandling Missing Data
Range Encoded
- Missing should be the smallest value
- Largest value would require more operations to
produce the final result - For missing record, corresponding bit is 1 in all
bitmaps
17Bitmaps REQuerying Missing Data
18Experimental Results
- Over real and synthetic datasets
- With missing data, query execution time remains
linear in the number of dimensions queried
19Handling the characteristics of the data
Large Volume of Data Improved Compression
Follow up work from A. Pinar, T. Tao, and H.
Ferhatosmanoglu. Compressing bitmap indices by
data reorganization. ICDE 2005.
20Tuple Reordering Problem
- NP-Complete
- Most TSP heuristics are ineffective
- Used a 2-switch Greedy heuristics where edge
weights computed on the fly - Repeatedly seeks a pair of vertices that decrease
the solution when they switch positions - Restricted to only those pairs within a distance
- Need more efficiency scalability
21Solution
- Minimize the hamming distance of adjacent numbers
- Gray-code A space filling-curve for hamming
space - A scalable, in-place algorithm to gray-sort the
tuples
22(No Transcript)
23 Reorder (A, start, end, b) 1 i ? start 2 j
? end 3 while i lt j do 4 Decrement j
until S(j,Colsb) 0 5 Increment i until
S(i,Colsb) 1 6 if i lt j then 7
Swap the ith and jth tuples on the table 8
end if 9 end while 10 if b lt no_of_bits
then 11 Colsb1 nextCol() 12 Reorder
(A, start, j, b1) 13 Reorder (A, j1, end,
b1) 14 Reverse (j1, end) 15 end if
Input Bitmaps
Column Ordering
Row Ordering
Run Length Encoding
Compressed Bitmaps
24Column Ordering Criteria
- Using Bitmap Data
- Number of Set Bits
- Static Most 1s Asc/Desc
- Dynamic Most 1s in the Longest Run
- Compressibility
- Static
- Dynamic Weighted Sum of Compressibility over the
Gray code segments
25Experimental Results
WAH
- 2-10x improvement in compression over already
compressed bitmaps - 4-7x improvement in query execution time
Gray Code
Gray Code Col Order
26Column Ordering Criteria
- Using Query Access Patterns
- Most frequent accessed columns first
- Performance
- QHF following a Zipf distribution
- Query Execution Time
- Original bitmaps 9.5 seconds
- Reordered bitmaps 0.063 seconds.
- Speedup 150 times.
27Effective support for existing queries
Direct Access over Bitmap Indices
Tan Apaydin, Guadalupe Canahuate, Hakan
Ferhatosmanoglu, Ali Saman Tosun, "Approximate
Encoding for Direct Access and Query Processing
over Compressed Bitmaps". International
Conference on Very Large Data Bases, Seoul,
Korea, 2006/September, pp. 846-857
28Background
- Bitmaps need to be compressed
- Row numbers do not longer correspond to the bit
position in the bitmap - Queries over few particular rows
- As expensive as queries asking for all the rows
- Commonly, users are only interested in a small
subset of the dataset at a time. - For example
- A query over the experiments of the last 7 days
- Spatial queries over objects in a specific
geographical area
29Our Goal
- Enable direct access over any subset of the
bitmap - Achieve effective compression
- Maintain bitwise operations for query execution
- Trade-off accuracy vs. efficiency
- No false negatives
30The approach
- Our solution is inspired by Bloom Filters
- A 2m bit array indexed using k independent hash
functions - A data object is inserted by setting the k
positions in the array corresponding to the hash
values of the object - False positives can happen, but false negatives
cannot
31Approximate Bitmaps (AB)
- A bloom filter-like structure
- Only the set bits are inserted into the AB
- Three levels of encoding
- Per table, per attribute, per bitmap column
- Parameters
- The hash string mapping function, F
- The k hash functions, H1(x),,Hk(x)
- The size of the AB, n as 2m
- Precision in terms of a and k, (1-(1-e-k/a)k)
32AB Example
- A bitmap table for a dataset with 8 rows and 3
attributes. Each attribute is divided into 3
categories. - Bitmap Table Size 72 bits
- Number of set bits 24.
- F(i,j) concatenate(i,j) x
- H1(x) x mod 32
- m 5
- AB Size 25 32 bits
33AB Example - Insertion
- Initially all bits in the AB are zero
- To insert set bit in (1,1)
34AB Example - Insertion
- To insert set bit in (1,1)
- x 11
- H(11) 11 mod 32 11
- AB(11) 1
35AB Example - Insertion
- To insert set bit in (5,4)
- x 54
- H(54) 54 mod 32 22
- AB(22) 1
36AB Example - Insertion
37AB Example - Analysis
- Estimated Precision
- a ABSize/Set Bits
- a 32/24 1.33
- k 1
- FP (1-e-k/a)
- P 1-FP
- P 1-(1-e-1/1.33)
- P 47
- The underlined positions are false positives
- Only 8 out of the 48 zeros are set in the AB
38AB Example - Retrieval
- Consider this query, asking for 4 rows
- Row 4
- (4,7) H(47) 15
- AB(15)0
- (4,8) H(48) 16
- AB(16)1
- Row 5
- (5,7) H(57) 25
- AB(25)1
- Stop
- This a range query over 4 rows, where the third
attribute falls into C1 or C2
39AB Example - Retrieval
- Consider this query, asking for 4 rows
- Row 6
- (6,7) H(67) 3
- AB(67)1
- Stop
- Approx Query Answer
- 1,1,1,0
- Exact Answer
- 0,1,1,0
40Approximate Bitmaps (AB) Hash Functions
- Single Hash Function
- Called once and the result is divided into
pieces. - Each piece considered as the value of a different
hash function. - Secure Hash Algorithm (SHA), developed by
National Institute of Standards and Technology
(NIST) - Multiple Hash Functions
- Independent hash functions
- For large number, similar performance
Hash Function H0 H1 H2 ... H9 Bits
159..144 143..128 127..112 ... 15..0 SHA
Output 0100100010001010 1000010100100001 0111100
011100010 ... 0000010101110011
41Approximate Bitmaps (AB) FP Rate
- FP Rate Probability that all k bits are set by
another data object - n is the size of the AB
- s is the number of set bits
- n as, a n/s
42Experimental Setup
- Query by sampling (randomly selecting the columns
queried) - Varying the number of rows queried from 100 to
10K
43Experimental Results - Size
- Always use the max a that produces a smaller or
comparable AB than WAH
44Experimental Results Exec Time
- Execution time of the AB depends on the number of
rows queried, not in the number of rows in the
dataset
- For queries over less than 1015 of the rows,
AB execution is up to 3 orders of magnitude
faster than WAH
45New Types of Queries
Nearest Neighbor Queries over Bitmap Indices
Work in progress Joint work with Michael Gibas
46General Idea
- Pre-compute the solution space (approximately)
- For each bitmap
- Tuple is 1 if the data object is a NN for any
point in the hyperspace defined by the bitmap - Otherwise, 0
47General Idea
- To execute a query
- Given a query point Q (d dimensions)
- Find the d bitmaps where Q falls into
- AND together the d-bitmaps
- The intersection is the set of candidate NNs
48Preliminary Results
- Encouraging for both index size and execution
time when compared to R-Trees. - The fast bitwise operations outperform the bound
computations in the R-Tree
49Conclusion
- We have addressed some of the issues that arise
when developing a framework for a specific domain - For scientific applications
- Handling missing data
- Large volume of data - improve storage
requirement - Efficient (approx) subsetting of the data
- NN-Queries
- Update efficient structures
50Questions and Comments
Email canahuat_at_cse.ohio-state.edu