A Quantization Based Framework for Scientific Data Management - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

A Quantization Based Framework for Scientific Data Management

Description:

Data is quantized into b categories using b bits ... No need to decompress for query processing. Full scan of the bitmap is needed. 5. Applications ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 51
Provided by: guadalupe
Category:

less

Transcript and Presenter's Notes

Title: A Quantization Based Framework for Scientific Data Management


1
A Quantization Based Framework for Scientific
Data Management
  • Guadalupe Canahuate
  • The Ohio State University
  • Adviser Hakan Ferhatosmanoglu

2
HEP Example
Online System
Data Files
Workstations
3
Bitmap Indices
  • Data is quantized into b categories using b bits
  • Each tuple is encoded based on which category its
    attribute belongs to
  • Bitmap Equality Encoded (BEE)
  • (Projection Index, Binning)
  • suited for point queries
  • Bitmap Range Encoded (BRE)
  • suited for range queries
  • Fast Bit Operations over bit vectors (AND, OR,
    NOT, XOR)

4
Bitmap Compression
  • Run-length encoders
  • Runs of 0s ? (0, run-count)
  • Byte-aligned Bitmap Code (BBC) Oracle 94
  • Word-Aligned Hybrid Code (WAH) LBNL 02
  • No need to decompress for query processing
  • Full scan of the bitmap is needed

5
Applications
  • Data warehouses
  • Scientific data
  • High-energy physics Simulations are continuously
    run, and notable events are stored with all the
    details.
  • Climate modeling sensor data.
  • Astro-physics telescopes devoted for
    observations.
  • Visualization applications

Supernova Observatory
6
Visualization Applications
  • Millions of different readings ordered by their
    geographic location
  • Users ask range queries over some of the readings
    for a given area
  • The answers are highlighted in the screen
  • Several degrees of resolution make approximate
    answers acceptable

7
Our Goal
  • Build a quantization framework
  • Handle the intrinsic characteristics of
    scientific data
  • Missing data
  • Large Data Volume
  • Efficiently support current queries
  • Subsetting
  • Support new types of queries over the quantized
    data
  • Nearest Neighbor Queries

8
Framework Overview
9
Handling the characteristics of the data
Missing Data Handler Query Semantic Component
Guadalupe Canahuate, Michael Gibas, Hakan
Ferhatosmanoglu, "Indexing Incomplete Databases",
International Conference on Extending Data Base
Technology (EDBT) , Munich, Germany, 2006/March,
pp. 884-901
10
Motivation
  • Incomplete databases are common in all research
    and industry domains
  • Missing values can be filled in by using
    statistical analysis
  • However, in many cases, missingness is not
    ignorable
  • We want to be able to distinguish between the
    real values and the absence of such values.

11
Dealing with Missing Data
  • NULL values need to be translated the into
    indexable values
  • Distinguished value outside the attribute domain
  • Map missing to 0 or -1
  • High of missing data results in a highly skewed
    data
  • Traditional Hierarchical Multidimensional Indexes
    break down
  • Query needs to be translated into exponential
    number of subqueries
  • R-Trees 2d, 10 Missing, 23 times worse

12
Query Semantics
  • Missing is a match
  • In a patient database retrieve the records whose
    first name is John and middle name Paul.
  • Missing is not a match
  • In a census database retrieve the interviewee
    that answered C in question 4 and A in
    question 8

13
Quantization
  • Avoid decomposing the query into exponential
    number of subqueries
  • Map missing values to a distinguished value in
    the index
  • Performance does not degrade with the skewness
    introduced by the mapping function

14
Bitmaps EEHandling Missing Data
Equality Encoded
  • No degradation in compression
  • Need to be able to identify the missing values
  • Construct a bitmap for missing values
  • NOT operation over non-missing bitmaps includes
    missing records

15
Bitmaps EEQuerying Missing Data
16
Bitmaps REHandling Missing Data
Range Encoded
  • Missing should be the smallest value
  • Largest value would require more operations to
    produce the final result
  • For missing record, corresponding bit is 1 in all
    bitmaps

17
Bitmaps REQuerying Missing Data
18
Experimental Results
  • Over real and synthetic datasets
  • With missing data, query execution time remains
    linear in the number of dimensions queried

19
Handling the characteristics of the data
Large Volume of Data Improved Compression
Follow up work from A. Pinar, T. Tao, and H.
Ferhatosmanoglu. Compressing bitmap indices by
data reorganization. ICDE 2005.
20
Tuple Reordering Problem
  • NP-Complete
  • Most TSP heuristics are ineffective
  • Used a 2-switch Greedy heuristics where edge
    weights computed on the fly
  • Repeatedly seeks a pair of vertices that decrease
    the solution when they switch positions
  • Restricted to only those pairs within a distance
  • Need more efficiency scalability

21
Solution
  • Minimize the hamming distance of adjacent numbers
  • Gray-code A space filling-curve for hamming
    space
  • A scalable, in-place algorithm to gray-sort the
    tuples

22
(No Transcript)
23
Reorder (A, start, end, b) 1 i ? start 2 j
? end 3 while i lt j do 4 Decrement j
until S(j,Colsb) 0 5 Increment i until
S(i,Colsb) 1 6 if i lt j then 7
Swap the ith and jth tuples on the table 8
end if 9 end while 10 if b lt no_of_bits
then 11 Colsb1 nextCol() 12 Reorder
(A, start, j, b1) 13 Reorder (A, j1, end,
b1) 14 Reverse (j1, end) 15 end if
Input Bitmaps
Column Ordering
Row Ordering
Run Length Encoding
Compressed Bitmaps
24
Column Ordering Criteria
  • Using Bitmap Data
  • Number of Set Bits
  • Static Most 1s Asc/Desc
  • Dynamic Most 1s in the Longest Run
  • Compressibility
  • Static
  • Dynamic Weighted Sum of Compressibility over the
    Gray code segments

25
Experimental Results
WAH
  • 2-10x improvement in compression over already
    compressed bitmaps
  • 4-7x improvement in query execution time

Gray Code
Gray Code Col Order
26
Column Ordering Criteria
  • Using Query Access Patterns
  • Most frequent accessed columns first
  • Performance
  • QHF following a Zipf distribution
  • Query Execution Time
  • Original bitmaps 9.5 seconds
  • Reordered bitmaps 0.063 seconds.
  • Speedup 150 times.

27
Effective support for existing queries
Direct Access over Bitmap Indices
Tan Apaydin, Guadalupe Canahuate, Hakan
Ferhatosmanoglu, Ali Saman Tosun, "Approximate
Encoding for Direct Access and Query Processing
over Compressed Bitmaps". International
Conference on Very Large Data Bases, Seoul,
Korea, 2006/September, pp. 846-857
28
Background
  • Bitmaps need to be compressed
  • Row numbers do not longer correspond to the bit
    position in the bitmap
  • Queries over few particular rows
  • As expensive as queries asking for all the rows
  • Commonly, users are only interested in a small
    subset of the dataset at a time.
  • For example
  • A query over the experiments of the last 7 days
  • Spatial queries over objects in a specific
    geographical area

29
Our Goal
  • Enable direct access over any subset of the
    bitmap
  • Achieve effective compression
  • Maintain bitwise operations for query execution
  • Trade-off accuracy vs. efficiency
  • No false negatives

30
The approach
  • Our solution is inspired by Bloom Filters
  • A 2m bit array indexed using k independent hash
    functions
  • A data object is inserted by setting the k
    positions in the array corresponding to the hash
    values of the object
  • False positives can happen, but false negatives
    cannot

31
Approximate Bitmaps (AB)
  • A bloom filter-like structure
  • Only the set bits are inserted into the AB
  • Three levels of encoding
  • Per table, per attribute, per bitmap column
  • Parameters
  • The hash string mapping function, F
  • The k hash functions, H1(x),,Hk(x)
  • The size of the AB, n as 2m
  • Precision in terms of a and k, (1-(1-e-k/a)k)

32
AB Example
  • A bitmap table for a dataset with 8 rows and 3
    attributes. Each attribute is divided into 3
    categories.
  • Bitmap Table Size 72 bits
  • Number of set bits 24.
  • F(i,j) concatenate(i,j) x
  • H1(x) x mod 32
  • m 5
  • AB Size 25 32 bits

33
AB Example - Insertion
  • Initially all bits in the AB are zero
  • To insert set bit in (1,1)

34
AB Example - Insertion
  • To insert set bit in (1,1)
  • x 11
  • H(11) 11 mod 32 11
  • AB(11) 1

35
AB Example - Insertion
  • To insert set bit in (5,4)
  • x 54
  • H(54) 54 mod 32 22
  • AB(22) 1

36
AB Example - Insertion
  • After all insertions

37
AB Example - Analysis
  • Estimated Precision
  • a ABSize/Set Bits
  • a 32/24 1.33
  • k 1
  • FP (1-e-k/a)
  • P 1-FP
  • P 1-(1-e-1/1.33)
  • P 47
  • The underlined positions are false positives
  • Only 8 out of the 48 zeros are set in the AB

38
AB Example - Retrieval
  • Consider this query, asking for 4 rows
  • Row 4
  • (4,7) H(47) 15
  • AB(15)0
  • (4,8) H(48) 16
  • AB(16)1
  • Row 5
  • (5,7) H(57) 25
  • AB(25)1
  • Stop
  • This a range query over 4 rows, where the third
    attribute falls into C1 or C2

39
AB Example - Retrieval
  • Consider this query, asking for 4 rows
  • Row 6
  • (6,7) H(67) 3
  • AB(67)1
  • Stop
  • Approx Query Answer
  • 1,1,1,0
  • Exact Answer
  • 0,1,1,0

40
Approximate Bitmaps (AB) Hash Functions
  • Single Hash Function
  • Called once and the result is divided into
    pieces.
  • Each piece considered as the value of a different
    hash function.
  • Secure Hash Algorithm (SHA), developed by
    National Institute of Standards and Technology
    (NIST)
  • Multiple Hash Functions
  • Independent hash functions
  • For large number, similar performance

Hash Function H0 H1 H2 ... H9 Bits
159..144 143..128 127..112 ... 15..0 SHA
Output 0100100010001010 1000010100100001 0111100
011100010 ... 0000010101110011
41
Approximate Bitmaps (AB) FP Rate
  • FP Rate Probability that all k bits are set by
    another data object
  • n is the size of the AB
  • s is the number of set bits
  • n as, a n/s

42
Experimental Setup
  • Three datasets
  • Query by sampling (randomly selecting the columns
    queried)
  • Varying the number of rows queried from 100 to
    10K

43
Experimental Results - Size
  • Always use the max a that produces a smaller or
    comparable AB than WAH

44
Experimental Results Exec Time
  • Execution time of the AB depends on the number of
    rows queried, not in the number of rows in the
    dataset
  • For queries over less than 1015 of the rows,
    AB execution is up to 3 orders of magnitude
    faster than WAH

45
New Types of Queries
Nearest Neighbor Queries over Bitmap Indices
Work in progress Joint work with Michael Gibas
46
General Idea
  • Pre-compute the solution space (approximately)
  • For each bitmap
  • Tuple is 1 if the data object is a NN for any
    point in the hyperspace defined by the bitmap
  • Otherwise, 0

47
General Idea
  • To execute a query
  • Given a query point Q (d dimensions)
  • Find the d bitmaps where Q falls into
  • AND together the d-bitmaps
  • The intersection is the set of candidate NNs

48
Preliminary Results
  • Encouraging for both index size and execution
    time when compared to R-Trees.
  • The fast bitwise operations outperform the bound
    computations in the R-Tree

49
Conclusion
  • We have addressed some of the issues that arise
    when developing a framework for a specific domain
  • For scientific applications
  • Handling missing data
  • Large volume of data - improve storage
    requirement
  • Efficient (approx) subsetting of the data
  • NN-Queries
  • Update efficient structures

50
Questions and Comments
  • Thank you!

Email canahuat_at_cse.ohio-state.edu
Write a Comment
User Comments (0)
About PowerShow.com