Data Cube Computation and Data Generalization - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Data Cube Computation and Data Generalization

Description:

Efficient Computation of Data Cubes (4.1) ... Cam1. Cam2. Lex1. Lex2. Dammam. Jeddah. Riyadh. 44. 48. 47. All. 11. 23. 17. 26. 29. 33. 139. All. Branch ... – PowerPoint PPT presentation

Number of Views:2252

Avg rating:3.0/5.0

Slides: 42

Provided by: Jiawe7

Category:

more less

Transcript and Presenter's Notes

Title: Data Cube Computation and Data Generalization

1
Data Cube Computation and Data Generalization
Chapter 4
2
- The Course
DS
OLAP
DW
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
3
Outline

Types of cells
Types of Cubes
Efficient Computation of Data Cubes (4.1)
Exploration and Discovery in Multidimensional
Databases (4.2)

4
- Types of cells

Types of cells
Base cell a cell which belongs to a base cuboid
Aggregate cell a cell which belongs to a
non-base cuboid
Ancestor-descendent cells-Example
1-D cell c1 (Dammam, , , 5,000,000) is an
ancestor of a 2-D cell c2 (Dammam, Corrola1, ,
400,000) and a 3-D cell c3 (Dammam, Corrola1,
20071104, 100,000). c3 is a descendent of c1 and
c2

5
A Data Cube
Product
Branch cuboid
Base cuboid
Cor1
Cor2
Cam1
Cam2
Lex1
Lex2
All
Dammam
Branch
Jeddah
Riyadh
All
Product cuboid
Base cell
Apex Cuboid
Aggregate cell
6
--- A Sample Data Cube
Total annual sales of TV in U.S.A.
7
A Data Cube
8
- Types of cubes

Full cube All cells and cuboids materialized.
Iceberg cube Only cells satisfying certain
condition are created.
Closed cube No ancestor cell is created if its
measure is equal to that of its descendent cell.
Shell cube Only cuboids with limited number of
dimensions are created.

9
Outline

Types of cells and cubes
Efficient Computation of Data Cubes (4.1)
Exploration and Discovery in Multidimensional
Databases (4.2)

10
- Efficient Computation of Data Cubes

Preliminary cube computation tricks
Computing full/iceberg cubes 2 methodologies
Top-Down Multi-Way array aggregation
Bottom-Up Bottom-up computation BUC
High-dimensional OLAP A Minimal Cubing Approach
Computing alternative kinds of cubes

11
-- Preliminary Cube Computation Tricks

Sorting, hashing, and grouping operations are
applied to the dimension attributes in order to
reorder and cluster related tuples
Aggregates may be computed from previously
computed aggregates, rather than from the base
fact table
Smallest-child computing a cuboid from the
smallest, previously computed cuboid
Cache-results caching results of a cuboid from
which other cuboids are computed to reduce disk
I/Os
Amortize-scans computing as many as possible
cuboids at the same time to amortize disk reads
Share-sorts sharing sorting costs cross
multiple cuboids when sort-based method is used
Share-partitions sharing the partitioning cost
across multiple cuboids when hash-based
algorithms are used

12
-- Multi-Way Array Aggregation

Used for MOLAP and full cube computation
Array-based bottom-up algorithm
Using multi-dimensional chunks
Simultaneous aggregation on multiple dimensions
Intermediate aggregate values are re-used for
computing ancestor cuboids
Cannot do Apriori pruning No iceberg optimization

13
-- Multi-way Array Aggregation

Partition arrays into chunks (a small subcube
which fits in memory).
Compressed sparse array addressing (chunk_id,
offset)
Compute aggregates in multiway by visiting cube
cells in the order which minimizes the of times
to visit each cell, and reduces memory access and
storage cost.

What is the best traversing order to do multi-way
aggregation?
14
-- Multi-way Array Aggregation
B
15
-- Multi-way Array Aggregation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
16
-- Multi-way Array Aggregation

Assume the sizes of dimension, A, B, and C are
40, 400, 4000 respectively.
Therefore AB is the smallest and AC is the
largest 2-D planes
If chunks are scanned as 1, 2, 3, then 156,000
memory units are needed
If chunks are scanned as 1, 17, 33, 49, then
1,641,000 memory units are needed

17
-- Multi-way Array Aggregation
All
A
B
C
BC
AC
AB
ABC
Needs 156,000 Memory units
Needs 1,641,000 Memory units
18
-- Multi-way Array Aggregation

Method the planes should be sorted and computed
according to their size in ascending order
Idea keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for
the largest plane
Limitation of the method computing well only for
a small number of dimensions
If there are a large number of dimensions,
top-down computation and iceberg cube
computation methods can be explored

19
-- Bottom-Up Computation (BUC)

Bottom-up cube computation
(Note top-down in our view!)
Divides dimensions into partitions and
facilitates iceberg pruning
If a partition does not satisfy min_sup, its
descendants can be pruned
If minsup 1 Þ compute full CUBE!
No simultaneous aggregation

20
--- BUC Partitioning

Usually, entire data set
cant fit in main memory
Sort distinct values, partition into blocks that
fit
Continue processing
Optimizations
Partitioning
External Sorting, Hashing, Counting Sort
Ordering dimensions to encourage pruning
Cardinality, Skew, Correlation
Collapsing duplicates
Cant do holistic aggregates anymore!

21
--- BUC Example (Having count() gt 5)
JED
1
1
3
DAM
2
1
COR
8
CAM
LEX
Q1
Q2
Q3
Q4
DAMMAM
JEDDAH
COR
COR
CAM
CAM
LEX
LEX
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
22
--- BUC Example (Having count() gt 5)
All
All
1
Q
2
7
5
P
Q
B
Q1
Q2
Q3
Q4
6
3
4
P,B
Q,P
Q,B
Q,P
COR
CAM
LEX
B,P,Q
Q1
Q2
Q3
Q4
23
-- Motivation of High-D OLAP

Challenge to current cubing methods
The curse of dimensionality problem
Iceberg cubes only delay the inevitable
explosion
Full materialization still significant overhead
in accessing results on disk
High-D OLAP is needed in applications
Science and engineering analysis
Bio-data analysis thousands of genes
Statistical surveys hundreds of variables

24
--- Fast High-D OLAP with Minimal Cubing

Observation OLAP occurs only on a small subset
of dimensions at a time
Semi-Online Computational Model
Partition the set of dimensions into shell
fragments
Compute data cubes for each shell fragment while
retaining inverted indices or value-list indices
Given the pre-computed fragment cubes,
dynamically compute cube cells of the
high-dimensional data cube online

25
--- Properties of Proposed Method

Partitions the data vertically
Reduces high-dimensional cube into a set of lower
dimensional cubes
Online re-construction of original
high-dimensional space
Lossless reduction
Offers tradeoffs between the amount of
pre-processing and the speed of online computation

26
--- Example Computation

Let the cube aggregation function be count
Divide the 5 dimensions into 2 shell fragments
(A, B, C) and (D, E)

27
--- 1-D Inverted Indices

Build traditional invert index or RID list

28
--- Shell Fragment Cubes

Generalize the 1-D inverted indices to
multi-dimensional ones in the data cube sense

29
---- Shell Fragment Cubes

Compute all cuboids for data cubes ABC and DE
while retaining the inverted indices
For example, shell fragment cube ABC contains 7
cuboids
A, B, C
AB, AC, BC
ABC
This completes the offline computation stage

30
. --- Shell Fragment Cubes

Shell fragments do not have to be disjoint
Fragment groupings can be arbitrary to allow for
maximum online performance
Known common combinations (e.g.,ltcity, stategt)
should be grouped together.
Shell fragment sizes can be adjusted for optimal
balance between offline and online computation

31
--- ID_Measure Table

If measures other than count are present, store
in ID_measure table separate from the shell
fragments

32
--- The Frag-Shells Algorithm

Partition set of dimension (A1,,An) into a set
of k fragments (P1,,Pk).
Scan base table once and do the following
insert lttid, measuregt into ID_measure table.
for each attribute value ai of each
dimension Ai
build inverted index entry ltai,
tidlistgt
For each fragment partition Pi
build local fragment cube Si by
intersecting tid-lists in bottom- up
fashion.

33
--- Frag-Shells
D Cuboid
Dimensions
EF Cuboid
DE Cuboid
ABC Cube
DEF Cube
34
--- Online Query Computation

A query has the general form
Each ai has 3 possible values
Instantiated value
Aggregate function
Inquire ? function
For example,
returns a 2-D data cube.

35
--- Online Query Computation

Given the fragment cubes, process a query as
follows
Divide the query into fragment, same as the shell
Fetch the corresponding TID list for each
fragment from the fragment cube
Intersect the TID lists from each fragment to
construct instantiated base table
Compute the data cube using the base table with
any cubing algorithm

36
--- Online Query Computation
Online Cube
Instantiated Base Table
37
Outline

Types of cells and cubes
Efficient Computation of Data Cubes (4.1)
Exploration and Discovery in Multidimensional
Databases (4.2)

38
-- Discovery-Driven Exploration of Data Cubes

Hypothesis-driven
exploration by user, huge search space
Discovery-driven
Effective navigation of large OLAP data cubes
pre-compute measures indicating exceptions, guide
user in the data analysis, at all levels of
aggregation
Exception significantly different from the value
anticipated, based on a statistical model
Visual cues such as background color are used to
reflect the degree of exception of each cell

39
-- Kinds of Exceptions and their Computation

Parameters
SelfExp surprise of cell relative to other cells
at same level of aggregation
InExp surprise beneath the cell
PathExp surprise beneath cell for each
drill-down path
Computation of exception indicator (modeling
fitting and computing SelfExp, InExp, and PathExp
values) can be overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates