An%20Array-Based%20Algorithm%20for%20Simultaneous%20Multidimensional%20Aggregates - PowerPoint PPT Presentation

About This Presentation

Title:

An%20Array-Based%20Algorithm%20for%20Simultaneous%20Multidimensional%20Aggregates

Description:

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates ... the dimension order so that the largest dimensions appear in the fewest prefixes ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 46

Provided by: MLC49

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: An%20Array-Based%20Algorithm%20for%20Simultaneous%20Multidimensional%20Aggregates

1
An Array-Based Algorithm for Simultaneous
Multidimensional Aggregates

Y. Zhao, P. Deshpande, J. Naughton

2
Motivation

Previous papers showed the usefulness of the CUBE
operator. There are several algorithms for
computing the CUBE in Relational OLAP systems.
This paper proposes an algorithm for computing
the CUBE in Multidimensional OLAP systems.

3
CUBE in ROLAP

In ROLAP systems, 3 main ideas for efficiently
computing the CUBE
Group related tuples together (using sorting or
hashing)
Use grouping performed on sub-aggregates to speed
computation
Compute an aggregate from another aggregate
rather than the base table

4
CUBE in MOLAP

Cannot transfer algorithms from ROLAP to MOLAP,
because of the nature of the data
In ROLAP, data is stored in tuples that can be
sorted and reordered by value
In MOLAP, data cannot be rearranged, because the
position of the data determines the attribute
values

5
Multidimensional Array Storage

Data is stored in large, sparse arrays, which
leads to certain problems
The array may be too big for memory
Many of the cells may be empty and the array will
be too sparse

6
Chunking Arrays

Why chunk?
A simple row major layout (partitioning by
dimension) will favor certain dimensions over
others.
What is chunking?
A method for dividing a n-dimensional array into
smaller n-dimensional chunks and storing each
chunk as one object on disk

7
Chunks

Dimension B
CB
CB
CA
CA
CA
Dimension A
8
Array Compression

Chunk-offset compression for each valid entry,
we store (offsetInChunk, data) where
offsetInChunk is the offset from the start of the
chunk.
Compression is done on dense arrays (defined as
arrays more than 40 filled with data)

9
Naïve Array Cubing Algorithm

Similar to ROLAP, each aggregation is computed
from its parent in the lattice.
Each chunk is aggregated completely and then
written to disk before moving on the next chunk.

ABC
AB AC BC
A B C

10
Illustrative example

Example for BC
Start with BC face on 1 and sweep through
dimension A to aggregate.

11
Problems with Naïve approach

Each sub aggregate is calculated independently
E.g. this algorithm will compute AB from ABC,
then rescan ABC to calculate AC, then rescan ABC
to calculate BC
We need a method to simultaneously compute all
children of a parent in a single pass over the
parent

12
Single-Pass Multi-Way Array Cubing Algorithm

The order of scanning is vitally important in
determining how much memory is needed to compute
the aggregates.
A dimension order O (Dj1, Dj2, Djn) defines
the order in which dimensions are scanned.
Di size of dimension i
Ci size of the chunk for dimension i
Ci ltlt Di in general

13
Example of Multi-way Array
14
Concrete Example

Ci 4, Di 16
For BC group-bys, we need 1 chunk (4x4)
For AC, we need 4 chunks (16x4)
For AB, we need to keep track of whole slice of
the AB plane, so (16x16)

15
How much memory?

A formula for the minimum amount of memory can be
generalized.
Define p size of the largest common prefix
between the current group-by and its parent
P n-1
? Di x ? Ci
i1 Ip1

16
Example calculation

O A B C D, Ci 10,
Di 100, 200, 300, 400
For the ABD group-by, the max common prefix is
AB. Therefore the minimum amount of memory
necessary is
DA x DB x CD 100 x 200 x 10

17
More Memory Notes

In simple terms, every element q in the common
prefix contributes Dq while every other element
r not in the prefix contributes Cr
Since Ci ltlt Di, to minimize the memory usage,
we should minimize the max common prefix and
reorder the dimension order so that the largest
dimensions appear in the fewest prefixes

18
Minimum Memory Spanning Trees

O A B C
Why is the cost of B4?

19
Minimum MemorySpanning Trees (cont.)

Using the formula for calculating the minimum
amount of memory, we can build a MMST, s.t. the
total memory requirement is minimum for a given
dimension order.
For different dimension orders, the MMSTs may be
very different with very different memory
requirements

20
Effects of Dimension Order
21
More Effects of Dimension Order

The early elements in O (particularly the first
one) appear in the most prefixes and therefore,
contribute their dimension sizes to the memory
requirements.
The last element in O can never appear in any
prefix. Therefore, the total memory requirement
for computing the CUBE is independent of the size
of the last dimension.

22
Optimal Dimension Order

Based on the previous two ideas, the optimal
ordering for dimension is to sort them on
increasing dimension size.
The total memory requirement will be minimized
and will be independent of the size of the
largest dimension.

23
Graphs And Results
24
ROLAP vs. MOLAP
25
MOLAP wins
26
MOLAP for ROLAP system

The last chart demonstrates one of the unexpected
results from this paper.
We can use the MOLAP algorithm with ROLAP systems
by
Scan the table and load into an array.
Compute the CUBE on the array.
Convert results into tables.

27
MOLAP for ROLAP (cont.)

The results show that even with the additional
cost of conversion between data structures, the
MOLAP algorithm runs faster than directly
computing the CUBE on the ROLAP tables and it
scales much better.
In this scheme, the multi-array is used as a
query evaluation data structure rather than a
persistent storage structure.

28
Summary

The multidimensional array of MOLAP should be
chunked and compressed.
The Single-Pass Multi-Way Array method is a
simultaneous algorithm that allows the CUBE to be
calculated with a single pass over the data.
By minimizing the overlap in prefixes and sorting
dimensions in order of increasing size, we can
build a MMST that gives a plan for computing the
CUBE.

29
More Summary

On MOLAP systems, the CUBE is calculated much
faster than on ROLAP systems.
The most surprising (and useful) result is that
the MOLAP algorithm is so much faster that it can
be used on ROLAP systems as an intermediate step
in computing the CUBE.

30
Caching Multidimensional Queries Using Chunks

P. Deshpande, K. Ramasamy,
A. Shukla, J. Naughton

31
Caching

Caching is very important in OLAP systems, since
the queries are complex and they are required to
respond quickly.
Previous work in caching dealt with table-level
caching and query-level caching.
This paper will propose another level of
granularity using chunks.

32
Chunk-based caching

Benefits
Frequently accessed chunks will stay in the
cache.
A new query need not be contained within a
cached query to benefit from the cache

33
More on Chunking

More benefits
Closure property of chunks we can aggregate
chunks on one level to obtain chunks at different
levels of aggregation.
Less redundant storage leads to a better hit
ratio of the cache.

34
Chunking the Multi-dimensional Space

To divide the space into chunks, distinct values
along each dimension have to be divided into
ranges.
Hierarchical locality in OLAP queries suggests
that ordering by the hierarchy level is the best
option.

35
Ordering on Dimensions
36
Chunk Ranges

Uniform chunk ranges do not work so well with
hierarchical data.

37
Hierarchical Chunk Ranges
38
Caching Query Results

When a new query is issued, chunks needed to
answer the query are determined.
The list of chunks in broken into 2 parts
Relevant chunks from the cache
Missing chunks that have to be computed from the
backend

39
Chunked File Organization

The cost of a chunk miss can be reduced by
organizing data in chunks at the backend.
One possible method is to use multi-dimensional
arrays, but these require changing the data
structures a great deal and may result in the
loss of relational access to data.

40
Chunk Indexing

A chunk index is created so that given a chunk
number, it is possible to access all tuples in
that chunk.
The chunked file will have two interfaces the
relational interface for SQL statement, and
chunk-based interface for direct access to chunks.

41
Query Processing

How to determine whether a cached chunk can be
used to answer a query
Level of aggregation cached chunks at the same
level are used.
Condition Clause selection on non group-by
predicates must match exactly.

42
Implementation of Chunked Files

Add a new chunked file type to the backend
database.
Add a level of abstraction
Add a new attribute of chunk number
Sort based on chunk number
Create chunk index with a B-tree on the chunk
number

43
Replacement Schemes

LRU is not viable, because chunks at different
levels have different costs.
Benefit of a chunk is measured by fraction of
base table it represents
Use benefits of chunks as weights when
determining which chunk to replace in the cache.

44
Cost Saving Ratio

Defined as the percentage of the total cost of
the queries saved due to hits in the cache.
Better than a normal hit ratio, since chunks at
different levels have different benefits.

45
Summary

Chunk-based caching allows fine granularity and
queries to be partially answered from the cache.
A chunked file organization can reduce the cost
of a chunk miss with minimal cost in
implementation.
Performance depends heavily on choosing the right
chunk range and a good replacement policy

Write a Comment

User Comments (0)