Indexing Incomplete Databases - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Indexing Incomplete Databases

Description:

Illustration. Problem Definition. Proposed Solutions. Bitmaps. VA-Files. Experimental Results ... Illustration: R-Trees ... Illustration: Bitstring-Augmented Tree ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 23
Provided by: Guada2
Category:

less

Transcript and Presenter's Notes

Title: Indexing Incomplete Databases


1
Indexing Incomplete Databases
  • Guadalupe Canahuate
  • Michael Gibas
  • Hakan Ferhatosmanoglu

2
Outline
  • Motivation
  • Current Approaches
  • Problems
  • Illustration
  • Problem Definition
  • Proposed Solutions
  • Bitmaps
  • VA-Files
  • Experimental Results
  • Conclusion

3
Motivation
  • Incomplete databases are common in all research
    domains
  • In many cases, missingness is not ignorable
  • We want to be able to distinguish between the
    real values and the absence of such values.
  • Query semantics missing is a match, missing is
    not a match

4
Current Approaches (1)
  • Map to a distinguished value (-1 for positive
    numbers)
  • Problems
  • Hierarchical Indexes break down because of the
    skewness of the data
  • Multiple one dimensional indexes need to perform
    expensive set operations (union and intersection)

5
Illustration R-Trees
  • With skewed data the bounding rectangles overlap
    more making pruning of the tree inefficient

2 dim random data Non Missing
2 dim random data 40 Missing
6
Current Approaches (2)
  • Use a function to scatter the points with missing
    data over the data space
  • Solve the skewness
  • Problems
  • Query needs to be decomposed into exponential
    number of subqueries (exponential in the number
    of dimensions queried)

7
Illustration Bitstring-Augmented Tree
  • The average of the non-missing values is used as
    a mapping function for the missing values.
  • A bitstring associated with each record indicates
    whether data is missing or not (Points (.2,.2)
    and (.2,?) map to lt(.2,.2),11gt and lt(.2,.2),10gt)
  • It is necessary to look at all possible
    combinations 2k subqueries and the bitstring

Q 0.3 A1 0.5 AND 0.6 A2 0.8
8
Problem Definition
  • D (A1,A2, . . . ,Ad).
  • D is incomplete if tuples in it are allowed to
    have missing attribute values.
  • Domain(Ai)1Ci Ci is the cardinality of Ai
  • Query Q is a k-dimensional range interval
  • Range Interval v1 Ai v2
  • Missing is a match ? tuple t is an answer if
    every attribute of t in Q is missing or it falls
    in the corresponding range interval
  • Missing is not a match ? tuple t is an answer for
    Q if every attribute of t in Q is not missing and
    falls in the corresponding range interval

9
Proposed Solution
  • Index each dimension independently
  • Keep NULL values in the database
  • Map missing values to a distinguished value in
    the index
  • Use index structures that do not require Set
    Operations over the partial results
  • Bitmaps
  • VA-Files
  • Specify interval evaluation rules for
  • Missing Data is a Query Match
  • Missing Data is not a Query Match

10
Bitmaps - Overview
  • Data is quantized into Categories
  • Using b bits we have b bins
  • Each tuple is encoded based on which category its
    attribute belongs to
  • Equality Encoded (EE)
  • suited for point queries
  • Range Encoded (RE)
  • suited for range queries
  • Fast Bit Operations over bit vectors (AND, OR,
    NOT, XOR)
  • Disadvantage Space requirements

11
Bitmaps - Compression
  • Run length compression
  • Word-Align Hybrid (WAH)
  • Two types of words literal and fill words
  • Faster query execution

12
Bitmaps Handling Missing Data
  • Construct a bitmap for Missing Values
  • For EE Treat it as any other value
  • For RE Missing is the smallest value

13
Bitmaps EE Querying Missing Data
14
Bitmaps RE Querying Missing Data
15
VA-Files - Overview
  • Vector Approximation (VA) File
  • Each attribute domain is quantized into bins or
    buckets
  • Using b bits we have 2b bins
  • Query Execution
  • Quantize the query
  • First pass find records that fall into the
    quantized query intervals
  • Second pass look at the real data to prune false
    positives

16
VA-Files Handling Missing Data
  • Reserve one bucket (Bucket 0) for missing data
  • Using b bits we have 2b-1 bins
  • Now 2 bits encode 3 buckets in addition to a
    missing data bucket

17
VA-Files Querying Missing Data
18
Experimental Results (1 of 3)
  • Datasets
  • Compare results based on
  • Index Size Query Execution Time

19
Experimental Results (2 of 3)
  • Index Size
  • VA-File has the smallest size in every case
  • BEE Compression ratio increases as the of
    missing increases
  • BRE does not compress much

20
Experimental Results (3 of 3)
  • Query Execution Time
  • In general, Bitmap Query Execution is faster than
    VA-File
  • As cardinality increases BEE performs the worst
  • As the of Missing data increases BEE performs
    faster because it compresses better
  • Query Execution is linear in the number of
    dimensions queried in each case

21
Conclusion
  • The techniques presented are easy to apply and
    allow the effective indexing of missing data.
  • By indexing each dimension independently the
    index is not affected by the skewness of the data
    and the query does not need to be transformed
    into exponential number of subqueries.
  • Bitmaps and VA-File exhibit a tradeoff between
    execution time and indexing space. The bit
    operations used to evaluate queries for bitmaps
    are fast, but the space required to represent an
    exact bitmap are much higher than a corresponding
    exact VA-file.

22
Thank you!
  • Questions or comments?
  • Email canahuat_at_cse.ohio-state.edu
Write a Comment
User Comments (0)
About PowerShow.com