A Quantization Based Framework for Scientific Data Management - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

A Quantization Based Framework for Scientific Data Management

Description:

Data is quantized into b categories using b bits ... No need to decompress for query processing. Full scan of the bitmap is needed. 5. Applications ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 51

Provided by: guadalupe

Category:

more less

Transcript and Presenter's Notes

Title: A Quantization Based Framework for Scientific Data Management

1
A Quantization Based Framework for Scientific
Data Management

Guadalupe Canahuate
The Ohio State University
Adviser Hakan Ferhatosmanoglu

2
HEP Example
Online System
Data Files
Workstations
3
Bitmap Indices

Data is quantized into b categories using b bits
Each tuple is encoded based on which category its
attribute belongs to
Bitmap Equality Encoded (BEE)
(Projection Index, Binning)
suited for point queries
Bitmap Range Encoded (BRE)
suited for range queries
Fast Bit Operations over bit vectors (AND, OR,
NOT, XOR)

4
Bitmap Compression

Run-length encoders
Runs of 0s ? (0, run-count)
Byte-aligned Bitmap Code (BBC) Oracle 94
Word-Aligned Hybrid Code (WAH) LBNL 02
No need to decompress for query processing
Full scan of the bitmap is needed

5
Applications

Data warehouses
Scientific data
High-energy physics Simulations are continuously
run, and notable events are stored with all the
details.
Climate modeling sensor data.
Astro-physics telescopes devoted for
observations.
Visualization applications

Supernova Observatory
6
Visualization Applications

Millions of different readings ordered by their
geographic location
Users ask range queries over some of the readings
for a given area
The answers are highlighted in the screen
Several degrees of resolution make approximate
answers acceptable

7
Our Goal

Build a quantization framework
Handle the intrinsic characteristics of
scientific data
Missing data
Large Data Volume
Efficiently support current queries
Subsetting
Support new types of queries over the quantized
data
Nearest Neighbor Queries

8
Framework Overview
9
Handling the characteristics of the data
Missing Data Handler Query Semantic Component
Guadalupe Canahuate, Michael Gibas, Hakan
Ferhatosmanoglu, "Indexing Incomplete Databases",
International Conference on Extending Data Base
Technology (EDBT) , Munich, Germany, 2006/March,
pp. 884-901
10
Motivation

Incomplete databases are common in all research
and industry domains
Missing values can be filled in by using
statistical analysis
However, in many cases, missingness is not
ignorable
We want to be able to distinguish between the
real values and the absence of such values.

11
Dealing with Missing Data

NULL values need to be translated the into
indexable values
Distinguished value outside the attribute domain
Map missing to 0 or -1
High of missing data results in a highly skewed
data
Traditional Hierarchical Multidimensional Indexes
break down
Query needs to be translated into exponential
number of subqueries
R-Trees 2d, 10 Missing, 23 times worse

12
Query Semantics

Missing is a match
In a patient database retrieve the records whose
first name is John and middle name Paul.
Missing is not a match
In a census database retrieve the interviewee
that answered C in question 4 and A in
question 8

13
Quantization

Avoid decomposing the query into exponential
number of subqueries
Map missing values to a distinguished value in
the index
Performance does not degrade with the skewness
introduced by the mapping function

14
Bitmaps EEHandling Missing Data
Equality Encoded

No degradation in compression
Need to be able to identify the missing values
Construct a bitmap for missing values
NOT operation over non-missing bitmaps includes
missing records

15
Bitmaps EEQuerying Missing Data
16
Bitmaps REHandling Missing Data
Range Encoded

Missing should be the smallest value
Largest value would require more operations to
produce the final result
For missing record, corresponding bit is 1 in all
bitmaps

17
Bitmaps REQuerying Missing Data
18
Experimental Results

Over real and synthetic datasets
With missing data, query execution time remains
linear in the number of dimensions queried

19
Handling the characteristics of the data
Large Volume of Data Improved Compression
Follow up work from A. Pinar, T. Tao, and H.
Ferhatosmanoglu. Compressing bitmap indices by
data reorganization. ICDE 2005.
20
Tuple Reordering Problem

NP-Complete
Most TSP heuristics are ineffective
Used a 2-switch Greedy heuristics where edge
weights computed on the fly
Repeatedly seeks a pair of vertices that decrease
the solution when they switch positions
Restricted to only those pairs within a distance
Need more efficiency scalability

21
Solution

Minimize the hamming distance of adjacent numbers
Gray-code A space filling-curve for hamming
space
A scalable, in-place algorithm to gray-sort the
tuples

22
(No Transcript)
23
Reorder (A, start, end, b) 1 i ? start 2 j
? end 3 while i lt j do 4 Decrement j
until S(j,Colsb) 0 5 Increment i until
S(i,Colsb) 1 6 if i lt j then 7
Swap the ith and jth tuples on the table 8
end if 9 end while 10 if b lt no_of_bits
then 11 Colsb1 nextCol() 12 Reorder
(A, start, j, b1) 13 Reorder (A, j1, end,
b1) 14 Reverse (j1, end) 15 end if
Input Bitmaps
Column Ordering
Row Ordering
Run Length Encoding
Compressed Bitmaps
24
Column Ordering Criteria

Using Bitmap Data
Number of Set Bits
Static Most 1s Asc/Desc
Dynamic Most 1s in the Longest Run
Compressibility
Static
Dynamic Weighted Sum of Compressibility over the
Gray code segments

25
Experimental Results
WAH

2-10x improvement in compression over already
compressed bitmaps
4-7x improvement in query execution time

Gray Code
Gray Code Col Order
26
Column Ordering Criteria

Using Query Access Patterns
Most frequent accessed columns first
Performance
QHF following a Zipf distribution
Query Execution Time
Original bitmaps 9.5 seconds
Reordered bitmaps 0.063 seconds.
Speedup 150 times.

27
Effective support for existing queries
Direct Access over Bitmap Indices
Tan Apaydin, Guadalupe Canahuate, Hakan
Ferhatosmanoglu, Ali Saman Tosun, "Approximate
Encoding for Direct Access and Query Processing
over Compressed Bitmaps". International
Conference on Very Large Data Bases, Seoul,
Korea, 2006/September, pp. 846-857
28
Background

Bitmaps need to be compressed
Row numbers do not longer correspond to the bit
position in the bitmap
Queries over few particular rows
As expensive as queries asking for all the rows
Commonly, users are only interested in a small
subset of the dataset at a time.
For example
A query over the experiments of the last 7 days
Spatial queries over objects in a specific
geographical area

29
Our Goal

Enable direct access over any subset of the
bitmap
Achieve effective compression
Maintain bitwise operations for query execution
Trade-off accuracy vs. efficiency
No false negatives

30
The approach

Our solution is inspired by Bloom Filters
A 2m bit array indexed using k independent hash
functions
A data object is inserted by setting the k
positions in the array corresponding to the hash
values of the object
False positives can happen, but false negatives
cannot

31
Approximate Bitmaps (AB)

A bloom filter-like structure
Only the set bits are inserted into the AB
Three levels of encoding
Per table, per attribute, per bitmap column
Parameters
The hash string mapping function, F
The k hash functions, H1(x),,Hk(x)
The size of the AB, n as 2m
Precision in terms of a and k, (1-(1-e-k/a)k)

32
AB Example

A bitmap table for a dataset with 8 rows and 3
attributes. Each attribute is divided into 3
categories.
Bitmap Table Size 72 bits
Number of set bits 24.
F(i,j) concatenate(i,j) x
H1(x) x mod 32
m 5
AB Size 25 32 bits

33
AB Example - Insertion

Initially all bits in the AB are zero
To insert set bit in (1,1)

34
AB Example - Insertion

To insert set bit in (1,1)
x 11
H(11) 11 mod 32 11
AB(11) 1

35
AB Example - Insertion

To insert set bit in (5,4)
x 54
H(54) 54 mod 32 22
AB(22) 1

36
AB Example - Insertion

After all insertions

37
AB Example - Analysis

Estimated Precision
a ABSize/Set Bits
a 32/24 1.33
k 1
FP (1-e-k/a)
P 1-FP
P 1-(1-e-1/1.33)
P 47

The underlined positions are false positives
Only 8 out of the 48 zeros are set in the AB

38
AB Example - Retrieval

Consider this query, asking for 4 rows

Row 4
(4,7) H(47) 15
AB(15)0
(4,8) H(48) 16
AB(16)1
Row 5
(5,7) H(57) 25
AB(25)1
Stop

This a range query over 4 rows, where the third
attribute falls into C1 or C2

39
AB Example - Retrieval

Consider this query, asking for 4 rows

Row 6
(6,7) H(67) 3
AB(67)1
Stop
Approx Query Answer
1,1,1,0
Exact Answer
0,1,1,0

40
Approximate Bitmaps (AB) Hash Functions

Single Hash Function
Called once and the result is divided into
pieces.
Each piece considered as the value of a different
hash function.
Secure Hash Algorithm (SHA), developed by
National Institute of Standards and Technology
(NIST)
Multiple Hash Functions
Independent hash functions
For large number, similar performance

Hash Function H0 H1 H2 ... H9 Bits
159..144 143..128 127..112 ... 15..0 SHA
Output 0100100010001010 1000010100100001 0111100
011100010 ... 0000010101110011
41
Approximate Bitmaps (AB) FP Rate

FP Rate Probability that all k bits are set by
another data object
n is the size of the AB
s is the number of set bits
n as, a n/s

42
Experimental Setup

Three datasets

Query by sampling (randomly selecting the columns
queried)
Varying the number of rows queried from 100 to
10K

43
Experimental Results - Size

Always use the max a that produces a smaller or
comparable AB than WAH

44
Experimental Results Exec Time

Execution time of the AB depends on the number of
rows queried, not in the number of rows in the
dataset

For queries over less than 1015 of the rows,
AB execution is up to 3 orders of magnitude
faster than WAH

45
New Types of Queries
Nearest Neighbor Queries over Bitmap Indices
Work in progress Joint work with Michael Gibas
46
General Idea

Pre-compute the solution space (approximately)
For each bitmap
Tuple is 1 if the data object is a NN for any
point in the hyperspace defined by the bitmap
Otherwise, 0

47
General Idea

To execute a query
Given a query point Q (d dimensions)
Find the d bitmaps where Q falls into
AND together the d-bitmaps
The intersection is the set of candidate NNs

48
Preliminary Results

Encouraging for both index size and execution
time when compared to R-Trees.
The fast bitwise operations outperform the bound
computations in the R-Tree

49
Conclusion

We have addressed some of the issues that arise
when developing a framework for a specific domain
For scientific applications
Handling missing data
Large volume of data - improve storage
requirement
Efficient (approx) subsetting of the data
NN-Queries
Update efficient structures

50
Questions and Comments