Data Cube Computation and Data Generalization - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Data Cube Computation and Data Generalization

Description:

Efficient Computation of Data Cubes (4.1) ... Cam1. Cam2. Lex1. Lex2. Dammam. Jeddah. Riyadh. 44. 48. 47. All. 11. 23. 17. 26. 29. 33. 139. All. Branch ... – PowerPoint PPT presentation

Number of Views:2251
Avg rating:3.0/5.0
Slides: 42
Provided by: Jiawe7
Category:

less

Transcript and Presenter's Notes

Title: Data Cube Computation and Data Generalization


1
Data Cube Computation and Data Generalization
Chapter 4
2
- The Course
DS
OLAP
DW
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
3
Outline
  • Types of cells
  • Types of Cubes
  • Efficient Computation of Data Cubes (4.1)
  • Exploration and Discovery in Multidimensional
    Databases (4.2)

4
- Types of cells
  • Types of cells
  • Base cell a cell which belongs to a base cuboid
  • Aggregate cell a cell which belongs to a
    non-base cuboid
  • Ancestor-descendent cells-Example
  • 1-D cell c1 (Dammam, , , 5,000,000) is an
    ancestor of a 2-D cell c2 (Dammam, Corrola1, ,
    400,000) and a 3-D cell c3 (Dammam, Corrola1,
    20071104, 100,000). c3 is a descendent of c1 and
    c2

5
A Data Cube
Product
Branch cuboid
Base cuboid
Cor1
Cor2
Cam1
Cam2
Lex1
Lex2
All
Dammam
Branch
Jeddah
Riyadh
All
Product cuboid
Base cell
Apex Cuboid
Aggregate cell
6
--- A Sample Data Cube
Total annual sales of TV in U.S.A.
7
A Data Cube
8
- Types of cubes
  • Full cube All cells and cuboids materialized.
  • Iceberg cube Only cells satisfying certain
    condition are created.
  • Closed cube No ancestor cell is created if its
    measure is equal to that of its descendent cell.
  • Shell cube Only cuboids with limited number of
    dimensions are created.

9
Outline
  • Types of cells and cubes
  • Efficient Computation of Data Cubes (4.1)
  • Exploration and Discovery in Multidimensional
    Databases (4.2)

10
- Efficient Computation of Data Cubes
  • Preliminary cube computation tricks
  • Computing full/iceberg cubes 2 methodologies
  • Top-Down Multi-Way array aggregation
  • Bottom-Up Bottom-up computation BUC
  • High-dimensional OLAP A Minimal Cubing Approach
  • Computing alternative kinds of cubes

11
-- Preliminary Cube Computation Tricks
  • Sorting, hashing, and grouping operations are
    applied to the dimension attributes in order to
    reorder and cluster related tuples
  • Aggregates may be computed from previously
    computed aggregates, rather than from the base
    fact table
  • Smallest-child computing a cuboid from the
    smallest, previously computed cuboid
  • Cache-results caching results of a cuboid from
    which other cuboids are computed to reduce disk
    I/Os
  • Amortize-scans computing as many as possible
    cuboids at the same time to amortize disk reads
  • Share-sorts sharing sorting costs cross
    multiple cuboids when sort-based method is used
  • Share-partitions sharing the partitioning cost
    across multiple cuboids when hash-based
    algorithms are used

12
-- Multi-Way Array Aggregation
  • Used for MOLAP and full cube computation
  • Array-based bottom-up algorithm
  • Using multi-dimensional chunks
  • Simultaneous aggregation on multiple dimensions
  • Intermediate aggregate values are re-used for
    computing ancestor cuboids
  • Cannot do Apriori pruning No iceberg optimization

13
-- Multi-way Array Aggregation
  • Partition arrays into chunks (a small subcube
    which fits in memory).
  • Compressed sparse array addressing (chunk_id,
    offset)
  • Compute aggregates in multiway by visiting cube
    cells in the order which minimizes the of times
    to visit each cell, and reduces memory access and
    storage cost.

What is the best traversing order to do multi-way
aggregation?
14
-- Multi-way Array Aggregation
B
15
-- Multi-way Array Aggregation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
16
-- Multi-way Array Aggregation
  • Assume the sizes of dimension, A, B, and C are
    40, 400, 4000 respectively.
  • Therefore AB is the smallest and AC is the
    largest 2-D planes
  • If chunks are scanned as 1, 2, 3, then 156,000
    memory units are needed
  • If chunks are scanned as 1, 17, 33, 49, then
    1,641,000 memory units are needed

17
-- Multi-way Array Aggregation
All
A
B
C
BC
AC
AB
ABC
Needs 156,000 Memory units
Needs 1,641,000 Memory units
18
-- Multi-way Array Aggregation
  • Method the planes should be sorted and computed
    according to their size in ascending order
  • Idea keep the smallest plane in the main memory,
    fetch and compute only one chunk at a time for
    the largest plane
  • Limitation of the method computing well only for
    a small number of dimensions
  • If there are a large number of dimensions,
    top-down computation and iceberg cube
    computation methods can be explored

19
-- Bottom-Up Computation (BUC)
  • Bottom-up cube computation
  • (Note top-down in our view!)
  • Divides dimensions into partitions and
    facilitates iceberg pruning
  • If a partition does not satisfy min_sup, its
    descendants can be pruned
  • If minsup 1 Þ compute full CUBE!
  • No simultaneous aggregation

20
--- BUC Partitioning
  • Usually, entire data set
    cant fit in main memory
  • Sort distinct values, partition into blocks that
    fit
  • Continue processing
  • Optimizations
  • Partitioning
  • External Sorting, Hashing, Counting Sort
  • Ordering dimensions to encourage pruning
  • Cardinality, Skew, Correlation
  • Collapsing duplicates
  • Cant do holistic aggregates anymore!

21
--- BUC Example (Having count() gt 5)
JED
1
1
3
DAM
2
1
COR
8
CAM
LEX
Q1
Q2
Q3
Q4
DAMMAM
JEDDAH
COR
COR
CAM
CAM
LEX
LEX
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
22
--- BUC Example (Having count() gt 5)
All
All
1
Q
2
7
5
P
Q
B
Q1
Q2
Q3
Q4
6
3
4
P,B
Q,P
Q,B
Q,P
COR
CAM
LEX
B,P,Q
Q1
Q2
Q3
Q4
23
-- Motivation of High-D OLAP
  • Challenge to current cubing methods
  • The curse of dimensionality problem
  • Iceberg cubes only delay the inevitable
    explosion
  • Full materialization still significant overhead
    in accessing results on disk
  • High-D OLAP is needed in applications
  • Science and engineering analysis
  • Bio-data analysis thousands of genes
  • Statistical surveys hundreds of variables

24
--- Fast High-D OLAP with Minimal Cubing
  • Observation OLAP occurs only on a small subset
    of dimensions at a time
  • Semi-Online Computational Model
  • Partition the set of dimensions into shell
    fragments
  • Compute data cubes for each shell fragment while
    retaining inverted indices or value-list indices
  • Given the pre-computed fragment cubes,
    dynamically compute cube cells of the
    high-dimensional data cube online

25
--- Properties of Proposed Method
  • Partitions the data vertically
  • Reduces high-dimensional cube into a set of lower
    dimensional cubes
  • Online re-construction of original
    high-dimensional space
  • Lossless reduction
  • Offers tradeoffs between the amount of
    pre-processing and the speed of online computation

26
--- Example Computation
  • Let the cube aggregation function be count
  • Divide the 5 dimensions into 2 shell fragments
  • (A, B, C) and (D, E)

27
--- 1-D Inverted Indices
  • Build traditional invert index or RID list

28
--- Shell Fragment Cubes
  • Generalize the 1-D inverted indices to
    multi-dimensional ones in the data cube sense

29
---- Shell Fragment Cubes
  • Compute all cuboids for data cubes ABC and DE
    while retaining the inverted indices
  • For example, shell fragment cube ABC contains 7
    cuboids
  • A, B, C
  • AB, AC, BC
  • ABC
  • This completes the offline computation stage

30
. --- Shell Fragment Cubes
  • Shell fragments do not have to be disjoint
  • Fragment groupings can be arbitrary to allow for
    maximum online performance
  • Known common combinations (e.g.,ltcity, stategt)
    should be grouped together.
  • Shell fragment sizes can be adjusted for optimal
    balance between offline and online computation

31
--- ID_Measure Table
  • If measures other than count are present, store
    in ID_measure table separate from the shell
    fragments

32
--- The Frag-Shells Algorithm
  • Partition set of dimension (A1,,An) into a set
    of k fragments (P1,,Pk).
  • Scan base table once and do the following
  • insert lttid, measuregt into ID_measure table.
  • for each attribute value ai of each
    dimension Ai
  • build inverted index entry ltai,
    tidlistgt
  • For each fragment partition Pi
  • build local fragment cube Si by
    intersecting tid-lists in bottom- up
    fashion.

33
--- Frag-Shells
D Cuboid
Dimensions
EF Cuboid
DE Cuboid
ABC Cube
DEF Cube
34
--- Online Query Computation
  • A query has the general form
  • Each ai has 3 possible values
  • Instantiated value
  • Aggregate function
  • Inquire ? function
  • For example,
    returns a 2-D data cube.

35
--- Online Query Computation
  • Given the fragment cubes, process a query as
    follows
  • Divide the query into fragment, same as the shell
  • Fetch the corresponding TID list for each
    fragment from the fragment cube
  • Intersect the TID lists from each fragment to
    construct instantiated base table
  • Compute the data cube using the base table with
    any cubing algorithm

36
--- Online Query Computation
Online Cube
Instantiated Base Table
37
Outline
  • Types of cells and cubes
  • Efficient Computation of Data Cubes (4.1)
  • Exploration and Discovery in Multidimensional
    Databases (4.2)

38
-- Discovery-Driven Exploration of Data Cubes
  • Hypothesis-driven
  • exploration by user, huge search space
  • Discovery-driven
  • Effective navigation of large OLAP data cubes
  • pre-compute measures indicating exceptions, guide
    user in the data analysis, at all levels of
    aggregation
  • Exception significantly different from the value
    anticipated, based on a statistical model
  • Visual cues such as background color are used to
    reflect the degree of exception of each cell

39
-- Kinds of Exceptions and their Computation
  • Parameters
  • SelfExp surprise of cell relative to other cells
    at same level of aggregation
  • InExp surprise beneath the cell
  • PathExp surprise beneath cell for each
    drill-down path
  • Computation of exception indicator (modeling
    fitting and computing SelfExp, InExp, and PathExp
    values) can be overlapped with cube construction
  • Exception themselves can be stored, indexed and
    retrieved like precomputed aggregates

40
--- Examples Discovery-Driven Data Cubes
41
END
Write a Comment
User Comments (0)
About PowerShow.com