Indexing Incomplete Databases - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Indexing Incomplete Databases

Description:

Illustration. Problem Definition. Proposed Solutions. Bitmaps. VA-Files. Experimental Results ... Illustration: R-Trees ... Illustration: Bitstring-Augmented Tree ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 23

Provided by: Guada2

Category:

more less

Transcript and Presenter's Notes

Title: Indexing Incomplete Databases

1
Indexing Incomplete Databases

Guadalupe Canahuate
Michael Gibas
Hakan Ferhatosmanoglu

2
Outline

Motivation
Current Approaches
Problems
Illustration
Problem Definition
Proposed Solutions
Bitmaps
VA-Files
Experimental Results
Conclusion

3
Motivation

Incomplete databases are common in all research
domains
In many cases, missingness is not ignorable
We want to be able to distinguish between the
real values and the absence of such values.
Query semantics missing is a match, missing is
not a match

4
Current Approaches (1)

Map to a distinguished value (-1 for positive
numbers)
Problems
Hierarchical Indexes break down because of the
skewness of the data
Multiple one dimensional indexes need to perform
expensive set operations (union and intersection)

5
Illustration R-Trees

With skewed data the bounding rectangles overlap
more making pruning of the tree inefficient

2 dim random data Non Missing
2 dim random data 40 Missing
6
Current Approaches (2)

Use a function to scatter the points with missing
data over the data space
Solve the skewness
Problems
Query needs to be decomposed into exponential
number of subqueries (exponential in the number
of dimensions queried)

7
Illustration Bitstring-Augmented Tree

The average of the non-missing values is used as
a mapping function for the missing values.
A bitstring associated with each record indicates
whether data is missing or not (Points (.2,.2)
and (.2,?) map to lt(.2,.2),11gt and lt(.2,.2),10gt)

It is necessary to look at all possible
combinations 2k subqueries and the bitstring

Q 0.3 A1 0.5 AND 0.6 A2 0.8
8
Problem Definition

D (A1,A2, . . . ,Ad).
D is incomplete if tuples in it are allowed to
have missing attribute values.
Domain(Ai)1Ci Ci is the cardinality of Ai
Query Q is a k-dimensional range interval
Range Interval v1 Ai v2
Missing is a match ? tuple t is an answer if
every attribute of t in Q is missing or it falls
in the corresponding range interval
Missing is not a match ? tuple t is an answer for
Q if every attribute of t in Q is not missing and
falls in the corresponding range interval

9
Proposed Solution

Index each dimension independently
Keep NULL values in the database
Map missing values to a distinguished value in
the index
Use index structures that do not require Set
Operations over the partial results
Bitmaps
VA-Files
Specify interval evaluation rules for
Missing Data is a Query Match
Missing Data is not a Query Match

10
Bitmaps - Overview

Data is quantized into Categories
Using b bits we have b bins
Each tuple is encoded based on which category its
attribute belongs to
Equality Encoded (EE)
suited for point queries
Range Encoded (RE)
suited for range queries
Fast Bit Operations over bit vectors (AND, OR,
NOT, XOR)
Disadvantage Space requirements

11
Bitmaps - Compression

Run length compression
Word-Align Hybrid (WAH)
Two types of words literal and fill words
Faster query execution

12
Bitmaps Handling Missing Data

Construct a bitmap for Missing Values
For EE Treat it as any other value
For RE Missing is the smallest value

13
Bitmaps EE Querying Missing Data
14
Bitmaps RE Querying Missing Data
15
VA-Files - Overview

Vector Approximation (VA) File
Each attribute domain is quantized into bins or
buckets
Using b bits we have 2b bins
Query Execution
Quantize the query
First pass find records that fall into the
quantized query intervals
Second pass look at the real data to prune false
positives

16
VA-Files Handling Missing Data

Reserve one bucket (Bucket 0) for missing data
Using b bits we have 2b-1 bins
Now 2 bits encode 3 buckets in addition to a
missing data bucket

17
VA-Files Querying Missing Data
18
Experimental Results (1 of 3)

Datasets
Compare results based on
Index Size Query Execution Time

19
Experimental Results (2 of 3)

Index Size
VA-File has the smallest size in every case
BEE Compression ratio increases as the of
missing increases
BRE does not compress much

20
Experimental Results (3 of 3)

Query Execution Time
In general, Bitmap Query Execution is faster than
VA-File
As cardinality increases BEE performs the worst
As the of Missing data increases BEE performs
faster because it compresses better
Query Execution is linear in the number of
dimensions queried in each case

21
Conclusion

The techniques presented are easy to apply and
allow the effective indexing of missing data.
By indexing each dimension independently the
index is not affected by the skewness of the data
and the query does not need to be transformed
into exponential number of subqueries.
Bitmaps and VA-File exhibit a tradeoff between
execution time and indexing space. The bit
operations used to evaluate queries for bitmaps
are fast, but the space required to represent an
exact bitmap are much higher than a corresponding
exact VA-file.

22
Thank you!