Title: Cube Computation
1Cube Computation
- Prof. Navneet Goyal
- Computer Science Information Systems Department
- BITS, Pilani
2Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
country
product
date
1-D cuboids
product,date
product,country
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
3Efficient Data Cube Computation
- Data cube can be viewed as a lattice of cuboids
- The bottom-most cuboid is the base cuboid
- The top-most cuboid (apex) contains only one cell
- How many cuboids in an n-dimensional cube with L
levels? - Materialization of data cube
- Materialize every (cuboid) (full
materialization), none (no materialization), or
some (partial materialization) - Selection of which cuboids to materialize
- Based on size, sharing, access frequency, etc.
4Cube Computation ROLAP-Based Method
- Efficient cube computation methods
- ROLAP-based cubing algorithms (Agarwal et al96)
- Array-based cubing algorithm (Zhao et al97)
- Bottom-up computation method (Bayer
Ramarkrishnan99) - ROLAP-based cubing algorithms
- Sorting, hashing, and grouping operations are
applied to the dimension attributes in order to
reorder and cluster related tuples - Grouping is performed on some subaggregates as a
partial grouping step - Aggregates may be computed from previously
computed aggregates, rather than from the base
fact table
5Cube Computation ROLAP-Based Method (2)
- Hash/sort based methods (Agarwal et. al. VLDB96)
- Smallest-parent computing a cuboid from the
smallest cubod previously computed cuboid. - Cache-results caching results of a cuboid from
which other cuboids are computed to reduce disk
I/Os - Amortize-scans computing as many as possible
cuboids at the same time to amortize disk reads - Share-sorts sharing sorting costs cross
multiple cuboids when sort-based method is used - Share-partitions sharing the partitioning cost
cross multiple cuboids when hash-based algorithms
are used
6Multi-way Array Aggregation for Cube Computation
- Partition arrays into chunks (a small subcube
which fits in memory). - Compressed sparse array addressing (chunk_id,
offset) - Compute aggregates in multiway by visiting cube
cells in the order which minimizes the of times
to visit each cell, and reduces memory access and
storage cost.
What is the best traversing order to do multi-way
aggregation?
7Multi-way Array Aggregation for Cube Computation
- Example3-D data array containing 3 dimensions A,
B, C - Array is partitioned into small, memory-based
chunks - 64 chunks
- Dimension A is organized into 4 equi-sized
partitions a0-a3 - Same for B C
- Chunks 1, 2,,64 correspond to the subcubes
a0b0c0, a1b0c0,a3b3c3.
8Multi-way Array Aggregation for Cube Computation
- Example3-D data array containing 3 dimensions A,
B, C - Cardinality of the dimensions A-40, B-400,
C-4000 - Size of each partition in A, B, C is therefore
10, 100, 1000.
9Multi-way Array Aggregation for Cube Computation
- Example3-D data array containing 3 dimensions A,
B, C - Many possible orderings with which chunks can be
read into memory - Suppose we want to computer b0c0 chunk of the BC
cuboid - By scanning chunks 1-4 of ABC, the b0c0 chunk is
computed - Cells for b0c0 are aggregated over a0 to a3.
10Multi-way Array Aggregation for Cube Computation
- Chunk memory can now be assigned to the next
chunk b1c0, which completes its aggregation after
scanning the next 4 chunks of ABC 5-8 - Continuing in this manner, the entire BC cuboid
can be computed - ONLY 1 chunk of BC needs to be in memory for the
computation of all chunks of BC
11Multi-way Array Aggregation for Cube Computation
- In computing entire BC cubiod, we will have
examined all the 64 chunks - Is there a way to avoid having to rescan all of
these chunks for the computation of other
cuboids? - YES
- MULTIWAY COMPUTATION OR SIMULTANEOUS AGGREGATION
12Multi-way Array Aggregation for Cube Computation
- For example, when chunk 1 (a0b0c0) is being
scanned (say for the computation of the 2D chunk
b0c0 of BC), all of the other 2D chunks relating
to a0b0c0 can be simultaneoulsy computed. - That is, when a0b0c0 is being scanned, each of
the three chunks, b0c0, a0c0, a0b0, on the
three 2D aggregation planes, BC, AC, AB, should
be computed then as well
13Multi-way Array Aggregation for Cube Computation
- Multiway computation simultaneously aggregates to
each of the 2D planes while a 3D chunk is in
memory. - Largest 2D plane is BC (40040001600000), then
AC (404000160000) finally AB (4040016000). - Scan the chunks in the order shown below.
14Multi-way Array Aggregation for Cube Computation
B
b0c0 chunk
15Multi-way Array Aggregation for Cube Computation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
16Multi-way Array Aggregation for Cube Computation
- Suppose chunks are scanned in the order shown
below - One chunk of the largest 2D plane BC is fully
computed for each row scanned - b0co is fully aggregated after scanning 1-4
- Similarly b1co is fully aggregated after scanning
5-8 so on - Complete computation of 1 chunk of 2nd largest
plane AC, requires 13 chunks, given the ordering
1-64
17Multi-way Array Aggregation for Cube Computation
- Complete computation of 1 chunk of 2nd largest
plane AC, requires 13 chunks, given the ordering
1-64 - a0c0 is fully aggregated after scanning of 1,5,9,
13. - For smallest plane AB, the chunk a0b0 requires
scanning 49 chunks. ( 1,17,33, 49) - NOTE that AB requires the longest scan of chunks
18Multi-Way Array Aggregation for Cube Computation
(Cont.)
- Method the planes should be sorted and computed
according to their size in ascending order. - Idea keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for
the largest plane - Limitation of the method computing well only for
a small number of dimensions - If there are a large number of dimensions,
bottom-up computation and iceberg cube
computation methods can be explored
19Multi-Way Array Aggregation for Cube Computation
(Cont.)
BEST (156000 memory units)
WORST (1641000 memory units)
20References
- S. Agarwal, R. Agrawal, P. M. Deshpande, A.
Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional
aggregates. In Proc. 1996 Int. Conf. Very Large
Data Bases, 506-521, Bombay, India, Sept. 1996. - K. Beyer and R. Ramakrishnan. Bottom-Up
Computation of Sparse and Iceberg CUBEs. In
Proc. 1999 ACM-SIGMOD Int. Conf. Management of
Data (SIGMOD'99), 359-370, Philadelphia, PA, June
1999. - V. Harinarayan, A. Rajaraman, and J. D. Ullman.
Implementing data cubes efficiently. In Proc.
1996 ACM-SIGMOD Int. Conf. Management of Data,
pages 205-216, Montreal, Canada, June 1996. - Y. Zhao, P. M. Deshpande, and J. F. Naughton. An
array-based algorithm for simultaneous
multidimensional aggregates. In Proc. 1997
ACM-SIGMOD Int. Conf. Management of Data,
159-170, Tucson, Arizona, May 1997.
21Q A
22Thank You