Title: Indexing Incomplete Databases
1Indexing Incomplete Databases
- Guadalupe Canahuate
- Michael Gibas
- Hakan Ferhatosmanoglu
2Outline
- Motivation
- Current Approaches
- Problems
- Illustration
- Problem Definition
- Proposed Solutions
- Bitmaps
- VA-Files
- Experimental Results
- Conclusion
3Motivation
- Incomplete databases are common in all research
domains - In many cases, missingness is not ignorable
- We want to be able to distinguish between the
real values and the absence of such values. - Query semantics missing is a match, missing is
not a match
4Current Approaches (1)
- Map to a distinguished value (-1 for positive
numbers) - Problems
- Hierarchical Indexes break down because of the
skewness of the data - Multiple one dimensional indexes need to perform
expensive set operations (union and intersection)
5Illustration R-Trees
- With skewed data the bounding rectangles overlap
more making pruning of the tree inefficient
2 dim random data Non Missing
2 dim random data 40 Missing
6Current Approaches (2)
- Use a function to scatter the points with missing
data over the data space - Solve the skewness
- Problems
- Query needs to be decomposed into exponential
number of subqueries (exponential in the number
of dimensions queried)
7Illustration Bitstring-Augmented Tree
- The average of the non-missing values is used as
a mapping function for the missing values. - A bitstring associated with each record indicates
whether data is missing or not (Points (.2,.2)
and (.2,?) map to lt(.2,.2),11gt and lt(.2,.2),10gt)
- It is necessary to look at all possible
combinations 2k subqueries and the bitstring
Q 0.3 A1 0.5 AND 0.6 A2 0.8
8Problem Definition
- D (A1,A2, . . . ,Ad).
- D is incomplete if tuples in it are allowed to
have missing attribute values. - Domain(Ai)1Ci Ci is the cardinality of Ai
- Query Q is a k-dimensional range interval
- Range Interval v1 Ai v2
- Missing is a match ? tuple t is an answer if
every attribute of t in Q is missing or it falls
in the corresponding range interval - Missing is not a match ? tuple t is an answer for
Q if every attribute of t in Q is not missing and
falls in the corresponding range interval
9Proposed Solution
- Index each dimension independently
- Keep NULL values in the database
- Map missing values to a distinguished value in
the index - Use index structures that do not require Set
Operations over the partial results - Bitmaps
- VA-Files
- Specify interval evaluation rules for
- Missing Data is a Query Match
- Missing Data is not a Query Match
10Bitmaps - Overview
- Data is quantized into Categories
- Using b bits we have b bins
- Each tuple is encoded based on which category its
attribute belongs to - Equality Encoded (EE)
- suited for point queries
- Range Encoded (RE)
- suited for range queries
- Fast Bit Operations over bit vectors (AND, OR,
NOT, XOR) - Disadvantage Space requirements
11Bitmaps - Compression
- Run length compression
- Word-Align Hybrid (WAH)
- Two types of words literal and fill words
- Faster query execution
12Bitmaps Handling Missing Data
- Construct a bitmap for Missing Values
- For EE Treat it as any other value
- For RE Missing is the smallest value
13Bitmaps EE Querying Missing Data
14Bitmaps RE Querying Missing Data
15VA-Files - Overview
- Vector Approximation (VA) File
- Each attribute domain is quantized into bins or
buckets - Using b bits we have 2b bins
- Query Execution
- Quantize the query
- First pass find records that fall into the
quantized query intervals - Second pass look at the real data to prune false
positives
16VA-Files Handling Missing Data
- Reserve one bucket (Bucket 0) for missing data
- Using b bits we have 2b-1 bins
- Now 2 bits encode 3 buckets in addition to a
missing data bucket
17VA-Files Querying Missing Data
18Experimental Results (1 of 3)
- Datasets
- Compare results based on
- Index Size Query Execution Time
19Experimental Results (2 of 3)
- Index Size
- VA-File has the smallest size in every case
- BEE Compression ratio increases as the of
missing increases - BRE does not compress much
20Experimental Results (3 of 3)
- Query Execution Time
- In general, Bitmap Query Execution is faster than
VA-File - As cardinality increases BEE performs the worst
- As the of Missing data increases BEE performs
faster because it compresses better - Query Execution is linear in the number of
dimensions queried in each case
21Conclusion
- The techniques presented are easy to apply and
allow the effective indexing of missing data. - By indexing each dimension independently the
index is not affected by the skewness of the data
and the query does not need to be transformed
into exponential number of subqueries. - Bitmaps and VA-File exhibit a tradeoff between
execution time and indexing space. The bit
operations used to evaluate queries for bitmaps
are fast, but the space required to represent an
exact bitmap are much higher than a corresponding
exact VA-file.
22Thank you!
- Questions or comments?
- Email canahuat_at_cse.ohio-state.edu