Title: DenseRegion Based Compact Data Cube
1Dense-Region Based Compact Data Cube
2Outline
- Background
- Introduction to Compact Data Cube
- Pros and cons of the Compact Data Cube method
- Dense-Region Based Compact Data Cube
3Background
- Why is a data cube?
- Some pre-computed aggregates on the underlying
data warehouse. - System constraints on materializing data cube(s)
- Disk space, maintenance cost, etc.
- Common approach materialize parts of a data
cube. - Alternative use approximation technique
- Reason OLAP applications accept approximate
answers in many scenarios.
4Introduction to Compact Data Cube
- Compact Data Cube was proposed by Vitter and Wang
in Approximate Computation of Multidimensional
Aggregates of Sparse Data Using Wavelets (SIGMOD
99). - Main Ideas
- Offline phase perform Haar wavelet transform on
the underlying data (i.e. the base cuboid) and
store the k most significant coefficients. - Online phase process any given query based on
the k most significant coefficients.
5Introduction to Compact Data Cube
- Basics of Haar wavelet transform
- Building Compact Data Cube
- Thresholding and Ranking
- Answering On-Line Queries
6Introduction to Compact Data Cube
- Basics of Haar wavelet transform
- e.g. S 2, 2, 0, 2, 3, 5, 4, 4
7Introduction to Compact Data Cube
- Basics of Haar wavelet transform
- For compression reasons, the detail coefficients
are normalized. - The coefficients at the lower resolutions are
weighted more heavily. - Approximates the original signal by keeping only
the most significant coefficients. - Requires only O(N) CPU time and O(N/B) I/Os to
compute for a signal of N values. - Multidimensional wavelet transform a series of
one-dimensional wavelet transforms.
8Introduction to Compact Data Cube
- Building the Compact Data Cube
- Problem 1 the size of the multidimensional array
representing the underlying data is too large
(assume the data are very sparse). - Solution Divide the wavelet transform process
into multiple passes.
9Introduction to Compact Data Cube
- Building the Compact Data Cube
- Problem 2 The density of the intermediate
results would increase from pass to pass. - Solution truncate the intermediate
multidimensional array by cutting off entries
with small magnitude. - I/O complexity
10Introduction to Compact Data Cube
- Thresholding and Ranking
- Choice 1 keep the C largest (in absolute value)
wavelet coefficients. - Choice 2 keep the C wavelet coefficients with
the largest weights among the C largest
coefficients (C lt C). - The weight of a coefficient equals to the number
of its dimensions with value zero.
11Introduction to Compact Data Cube
- Answering On-Line Queries
- Space ((d1)k), CPU time
12Pros and cons of the compact data cube method
- Pros
- Requires little disk spaces (a small number of
disk blocks). - Responds to on-line query fast.
- Answers OLAP queries more accurately than other
approximation techniques like histogram and
random sampling. - Can progressively refine the approximate answer
with no added overhead.
13Pros and cons of the compact data cube method
- Cons
- Approximates a vast amount of useless empty cells
in base cuboid together with useful non-empty
cells in base cuboid. - Needs to cut off entries with small magnitude at
the end of each pass in order to maintain a
constant amount of I/O operations from pass to
pass.
14Dense-Region Based Compact Data Cube
- Aim
- To enhance the ability of the compact data cube
method to handle datasets having
dense-regions-in-sparse-cube property. - Main Idea
- To exclude empty cells in base cuboid from
approximation. - Two-phase approach
- Compute dense regions in base cuboid.
- Approximate each dense region independently.
15Dense-Region Based Compact Data Cube
- Question 1 how can we find the dense regions
efficiently? - Efficient DEnse region Mining (EDEM) algorithm
proposed by Cheung et al. in DROLAP -- A
Dense-Region-Based Approach to On-line Analytical
Processing (DEXA99)
16Dense-Region Based Compact Data Cube
- Basic ideas of EDEM
- Build a k-d tree to store the valid cells.
- Grow dense region covers along boundaries.
- Search dense regions among the covers.
- Complexity of EDEM linear to the number of
dimensions and sub-quadratic to the number of
data points.
17Dense-Region Based Compact Data Cube
- Question 2 how should we allocate disk space in
approximating the dense regions? - Choice 1 allocate disk space equally to each
dense regions. - Choice 2 allocate disk space according to the
sizes of dense regions. - Choice 3 order the wavelet coefficients of all
the dense regions and keep the most significant
ones (in absolute value).
18Dense-Region Based Compact Data Cube
- Question 3 how should we treat the data points
outside the dense regions? - Keep all or keep only significant ones.
- Question 4 how do we answer on-line queries
using the dense-region based approach? - Check if a dense region covered by the given
query. - Check if the stored coefficients contribute to
the range sum and compute the amount of
contribution if needed.
19Dense-Region Based Compact Data Cube
- One favorable side effect
- we may parallelize the construction of compact
data cube. - More questions
- How can we handle updates to the underlying data?
- How can we approximate iceberg cube? Can we apply
the idea of compact data cube to iceberg cube? - Can compact data cube be used to answer other
types of OLAP queries besides range-sum?