Title: Data Cube Computation and Data Generalization
1Data Cube Computation and Data Generalization
Chapter 4
2- The Course
DS
OLAP
DW
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
3Outline
- Types of cells
- Types of Cubes
- Efficient Computation of Data Cubes (4.1)
- Exploration and Discovery in Multidimensional
Databases (4.2)
4- Types of cells
- Types of cells
- Base cell a cell which belongs to a base cuboid
- Aggregate cell a cell which belongs to a
non-base cuboid - Ancestor-descendent cells-Example
- 1-D cell c1 (Dammam, , , 5,000,000) is an
ancestor of a 2-D cell c2 (Dammam, Corrola1, ,
400,000) and a 3-D cell c3 (Dammam, Corrola1,
20071104, 100,000). c3 is a descendent of c1 and
c2
5A Data Cube
Product
Branch cuboid
Base cuboid
Cor1
Cor2
Cam1
Cam2
Lex1
Lex2
All
Dammam
Branch
Jeddah
Riyadh
All
Product cuboid
Base cell
Apex Cuboid
Aggregate cell
6--- A Sample Data Cube
Total annual sales of TV in U.S.A.
7A Data Cube
8- Types of cubes
- Full cube All cells and cuboids materialized.
- Iceberg cube Only cells satisfying certain
condition are created. - Closed cube No ancestor cell is created if its
measure is equal to that of its descendent cell. - Shell cube Only cuboids with limited number of
dimensions are created.
9Outline
- Types of cells and cubes
- Efficient Computation of Data Cubes (4.1)
- Exploration and Discovery in Multidimensional
Databases (4.2)
10- Efficient Computation of Data Cubes
- Preliminary cube computation tricks
- Computing full/iceberg cubes 2 methodologies
- Top-Down Multi-Way array aggregation
- Bottom-Up Bottom-up computation BUC
- High-dimensional OLAP A Minimal Cubing Approach
- Computing alternative kinds of cubes
11-- Preliminary Cube Computation Tricks
- Sorting, hashing, and grouping operations are
applied to the dimension attributes in order to
reorder and cluster related tuples - Aggregates may be computed from previously
computed aggregates, rather than from the base
fact table - Smallest-child computing a cuboid from the
smallest, previously computed cuboid - Cache-results caching results of a cuboid from
which other cuboids are computed to reduce disk
I/Os - Amortize-scans computing as many as possible
cuboids at the same time to amortize disk reads - Share-sorts sharing sorting costs cross
multiple cuboids when sort-based method is used - Share-partitions sharing the partitioning cost
across multiple cuboids when hash-based
algorithms are used
12-- Multi-Way Array Aggregation
- Used for MOLAP and full cube computation
- Array-based bottom-up algorithm
- Using multi-dimensional chunks
- Simultaneous aggregation on multiple dimensions
- Intermediate aggregate values are re-used for
computing ancestor cuboids - Cannot do Apriori pruning No iceberg optimization
13 -- Multi-way Array Aggregation
- Partition arrays into chunks (a small subcube
which fits in memory). - Compressed sparse array addressing (chunk_id,
offset) - Compute aggregates in multiway by visiting cube
cells in the order which minimizes the of times
to visit each cell, and reduces memory access and
storage cost.
What is the best traversing order to do multi-way
aggregation?
14 -- Multi-way Array Aggregation
B
15 -- Multi-way Array Aggregation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
16 -- Multi-way Array Aggregation
- Assume the sizes of dimension, A, B, and C are
40, 400, 4000 respectively. - Therefore AB is the smallest and AC is the
largest 2-D planes - If chunks are scanned as 1, 2, 3, then 156,000
memory units are needed - If chunks are scanned as 1, 17, 33, 49, then
1,641,000 memory units are needed
17 -- Multi-way Array Aggregation
All
A
B
C
BC
AC
AB
ABC
Needs 156,000 Memory units
Needs 1,641,000 Memory units
18 -- Multi-way Array Aggregation
- Method the planes should be sorted and computed
according to their size in ascending order - Idea keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for
the largest plane - Limitation of the method computing well only for
a small number of dimensions - If there are a large number of dimensions,
top-down computation and iceberg cube
computation methods can be explored
19-- Bottom-Up Computation (BUC)
- Bottom-up cube computation
- (Note top-down in our view!)
- Divides dimensions into partitions and
facilitates iceberg pruning - If a partition does not satisfy min_sup, its
descendants can be pruned - If minsup 1 Þ compute full CUBE!
- No simultaneous aggregation
20--- BUC Partitioning
- Usually, entire data set
cant fit in main memory - Sort distinct values, partition into blocks that
fit - Continue processing
- Optimizations
- Partitioning
- External Sorting, Hashing, Counting Sort
- Ordering dimensions to encourage pruning
- Cardinality, Skew, Correlation
- Collapsing duplicates
- Cant do holistic aggregates anymore!
21--- BUC Example (Having count() gt 5)
JED
1
1
3
DAM
2
1
COR
8
CAM
LEX
Q1
Q2
Q3
Q4
DAMMAM
JEDDAH
COR
COR
CAM
CAM
LEX
LEX
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
22 --- BUC Example (Having count() gt 5)
All
All
1
Q
2
7
5
P
Q
B
Q1
Q2
Q3
Q4
6
3
4
P,B
Q,P
Q,B
Q,P
COR
CAM
LEX
B,P,Q
Q1
Q2
Q3
Q4
23-- Motivation of High-D OLAP
- Challenge to current cubing methods
- The curse of dimensionality problem
- Iceberg cubes only delay the inevitable
explosion - Full materialization still significant overhead
in accessing results on disk - High-D OLAP is needed in applications
- Science and engineering analysis
- Bio-data analysis thousands of genes
- Statistical surveys hundreds of variables
24--- Fast High-D OLAP with Minimal Cubing
- Observation OLAP occurs only on a small subset
of dimensions at a time - Semi-Online Computational Model
- Partition the set of dimensions into shell
fragments - Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices - Given the pre-computed fragment cubes,
dynamically compute cube cells of the
high-dimensional data cube online
25--- Properties of Proposed Method
- Partitions the data vertically
- Reduces high-dimensional cube into a set of lower
dimensional cubes - Online re-construction of original
high-dimensional space - Lossless reduction
- Offers tradeoffs between the amount of
pre-processing and the speed of online computation
26--- Example Computation
- Let the cube aggregation function be count
- Divide the 5 dimensions into 2 shell fragments
- (A, B, C) and (D, E)
27--- 1-D Inverted Indices
- Build traditional invert index or RID list
28--- Shell Fragment Cubes
- Generalize the 1-D inverted indices to
multi-dimensional ones in the data cube sense
29 ---- Shell Fragment Cubes
- Compute all cuboids for data cubes ABC and DE
while retaining the inverted indices - For example, shell fragment cube ABC contains 7
cuboids - A, B, C
- AB, AC, BC
- ABC
- This completes the offline computation stage
30. --- Shell Fragment Cubes
- Shell fragments do not have to be disjoint
- Fragment groupings can be arbitrary to allow for
maximum online performance - Known common combinations (e.g.,ltcity, stategt)
should be grouped together. - Shell fragment sizes can be adjusted for optimal
balance between offline and online computation
31--- ID_Measure Table
- If measures other than count are present, store
in ID_measure table separate from the shell
fragments
32--- The Frag-Shells Algorithm
- Partition set of dimension (A1,,An) into a set
of k fragments (P1,,Pk). - Scan base table once and do the following
- insert lttid, measuregt into ID_measure table.
- for each attribute value ai of each
dimension Ai - build inverted index entry ltai,
tidlistgt - For each fragment partition Pi
- build local fragment cube Si by
intersecting tid-lists in bottom- up
fashion.
33 --- Frag-Shells
D Cuboid
Dimensions
EF Cuboid
DE Cuboid
ABC Cube
DEF Cube
34--- Online Query Computation
- A query has the general form
- Each ai has 3 possible values
- Instantiated value
- Aggregate function
- Inquire ? function
- For example,
returns a 2-D data cube.
35 --- Online Query Computation
- Given the fragment cubes, process a query as
follows - Divide the query into fragment, same as the shell
- Fetch the corresponding TID list for each
fragment from the fragment cube - Intersect the TID lists from each fragment to
construct instantiated base table - Compute the data cube using the base table with
any cubing algorithm
36 --- Online Query Computation
Online Cube
Instantiated Base Table
37Outline
- Types of cells and cubes
- Efficient Computation of Data Cubes (4.1)
- Exploration and Discovery in Multidimensional
Databases (4.2)
38-- Discovery-Driven Exploration of Data Cubes
- Hypothesis-driven
- exploration by user, huge search space
- Discovery-driven
- Effective navigation of large OLAP data cubes
- pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of
aggregation - Exception significantly different from the value
anticipated, based on a statistical model - Visual cues such as background color are used to
reflect the degree of exception of each cell
39-- Kinds of Exceptions and their Computation
- Parameters
- SelfExp surprise of cell relative to other cells
at same level of aggregation - InExp surprise beneath the cell
- PathExp surprise beneath cell for each
drill-down path - Computation of exception indicator (modeling
fitting and computing SelfExp, InExp, and PathExp
values) can be overlapped with cube construction - Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
40--- Examples Discovery-Driven Data Cubes
41END