ITCS 6163 - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

ITCS 6163

Description:

ITCS 6163 Cube Computation – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 50

Provided by: cscstuden7

Category:

more less

Transcript and Presenter's Notes

Title: ITCS 6163

1
ITCS 6163

Cube Computation

2
Two Problems

Which cuboids should be materialized?
Ullman et.al. Sigmod 96 paper
How to efficiently compute cube?
Agrawal et. al. vldb 96 paper

3
Implementing Data Cubes efficiently
DSS queries must be answered fast! 1 min
OK seconds great! gt 1 min NOT ACCEPTABLE! One
solution Materialize frequently asked queries
(or supersets) Picking the right set is difficult
(O(2n))
4
What to materialize

Nothing pure ROLAP
Everything too costly
Only part key idea many cells are computable
from other cells.

ABC
AB
BC
AC
5
Dependent and independent cells
Example TPC-D benchmark (supplier, part,
customer,sales) Q(find all sales for each part)
is dependent on Q (find all sales for each part
and each supplier) (p,all,all) ? (p,s,all)
(dependency) (p,all,all) ??i,j (p, si,
cj )
6
Example
PSC
1.PSC 6 Million cells 2.PC 6 Million
cells 3.PS 0.8 Million cells
4.SC 6 Million cells
5.P 0.2 Million cells 6.S 0.05 Million
cells 7.C 0.1 Million cells
8.ALL 1 cell
PC
PS
SC
P
S
C
ALL (Cube lattice)
7
Example (2)
We want to answer (p, all,all) (Sales group-by
part) a) If we have 5 materialized, just
answer. b) If we have 2, compute .Use 6 million
cells, do they fit in RAM? Cost of a) ?
0.2 M Cost of b) ? 6 M
8
Decisions, decisions...
How many views must we materialize to get good
performance? Given space S (on disk), which views
do we materialize? In the previous example wed
need space for 19 Million cells. Can we do
better? Avoid going to the raw (fact table)
data ? PSC (6 M) PC (6M) can be answered using
PSC (6 M) no advantage SC (6 M) can be answered
using PSC (6 M) no advantage
9
Example again
1 PSC ? 6M 6M 2 PC ? 6M - 3
PS ? 0.8M 0.8M 4 SC ? 6M - 5 P
? 0.2 M 0.2M 6 S ? 0.01M 0.01M 7
C ? 0.1M 0.1 --------- 7.11M vs.
19 M (about the same performance)
10
Formal treatment
Q1 ? Q2 (dependency) Q(P) ? Q(PC) ?
Q(PSC) (lattice) Add hierarchies C
(customers) S (suppliers) P(parts) N
(nation-wide cust. ) SN (nation-wide) Sz
Ty e.g., USA, Japan) (size)
(type) DF (domestic- ALL foreign)
ALL (all cust.) ALL
11
Formal treatment(2)
CP (6M)
C Sz (5M)
CTy (5.99M)
NP (5M)
N Sz (1,250)
C (0.1M)
P (0.2 M)
N Ty (3,750)
Sz (50)
N 2 5
Ty 150
ALL(1)
12
Formal Treatment(3)
Cost model Cost (Q) cells Qa
Q ? Qa With indexes we can make this
better! How do we know the size of the
views? Sampling
13
Optimizing Data Cube lattices
First problem (no space restrictions) VERY
HARD problem (NP-complete) Heuristics Alwa
ys include the core cuboid. At every step
you have materialized Sv views Compute the
benefit of view v relative to Sv as For each
w ? v define Bw Let u be the view of least
cost in Sv such that w ? u If Cost(v)
lt Cost (u) Bw Cost(v)-Cost(u) (-) else Bw
0 Define B(V,Sv) - ? w ? v B(w)
14
Greedy algorithm
Sv core view for i 1 to k
begin select v not in Sv such that
B(v,Sv) is maximum Sv Sv ? v
End
15
A simple example
16
MOLAP example

Hyperions Essbase
www.hyperion.com to download white paper
and product demo.
Builds a special secondary-memory data structure
to store the cells of the core cuboid.
Assumes that data is sparse and clustered along
some dimension combinations
Chooses dense dimension combinations.
The rest are sparse combinations.

17
Structures

Two levels
Blocks in the first level correspond to the
dense dimension combinations. The basic block
will have the size proportional to the product of
the cardinalities for these dimensions. Each
entry in the block points to a second-level
block.
Blocks in the second level correspond to the
sparse dimensions. They are arrays of pointers,
as many as the product of the cardinalities for
sparse dimensions. Each pointer has one of three
values null (non-existent data), impossible
(non-allowed combination) or a pointer to an
actual data block.

18
Data Example
Departments will generally have data for each
Time period. (so the two are the dense dimension
combination) Geographical information, Product
and Distribution channels, on the other hand are
typically sparse (e.g., most cities have only one
Distribution channel and some Product values).
Dimensions Departments (Sales,Mkt) Time Geographic
al information Product Distribution channels
19
Structures revisited
S,1Q
S,3Q
S,4Q
M,1Q
M,2Q
M,3Q
M,4Q
S,2Q
Geo., Product, Dist
20
Allocating memory
Define member structure (e.g., dimensions) Select
dense dimension combinations and create upper
level structure Create lower level
structure. Input data cell if pointer to data
block is empty, create new else insert data in
data block
21
Problem 2 COMPUTING DATACUBES

Four algorithms
PIPESORT
PIPEHASH
SORT-OVERLAP
Partitioned-cube

22
Optimizations

Smallest-parent
AB can be computed from ABC, ABD, or ABCD. Which
one should be use?
Cache-results
Having computed ABC, we compute AB from it while
ABC is still in memory
Amortize-scans
We may try to compute ABC, ACD, ABD, BCD in one
scan of ABCD
Share-sorts
Share-partitions

23
PIPESORT
Input Cube lattice and cost matrix. Each
edge (eij in the lattice is annotated with two
costs S(i,j) cost of computing j from i when i
is not sorted A(i,j) cost of computing j from i
when i is sorted Output Subgraph of the lattice
where each cuboid (group-by) is connected to a
single parent from which it will be computed and
is associated with an attribute order in which
it will be sorted. If the order is a prefix of
the order of its parent, then the child can be
computed without sorting the parent (cost A)
otherwise it has to be sorted (cost B). For every
parent there will be only one out-edge labeled A.
24
PIPESORT (2)
Algorithm Proceeds in levels, k 0,,N-1
(number of dimensions). For each level, finds the
best way of computing level k from level k1 by
reducing the problem to a weighted bypartite
problem. Make k additional copies of each
group-by (each node has then, k1 vertices) and
connect them to the same children of the
original. From the original copy, the
edges have A costs, while the costs from the
copies have S costs. Find the minimum cost
matching in the bypartite graph. (Each vertex in
level k1 matched with one vertex in level k.)
25
Example
26
Transformed lattice
A B
C
AB(2) AB(10) AC(5) AC(12) BC(13)
BC(20)
27
Explanation of edges
A
AB(2) AB(10)
This means that we really have BA (we need to
sort it to get A)
This means we have AB (no need to sort)
28
PIPESORT pseudo-algorithm
Pipesort (Input lattice with A() and S()
edges costs) For level k 0 to
N-1 Generate_plan(k1) For each cuboid
g in level k1 Fix sort order of g as the
order of the cuboid connected to g by an A
edge
29
Generate_plan
Generate_plan(k1) Make k additional
copies of each level k1 cuboid Connect each
copy to the same set of vertices as the
original Assign costs A to original
edges and S to copies Find minimum cost
matching on the transformed graph
30
Example
31
PipeHash
Input lattice PipeHash chooses for each vertex
the parent with the smallest estimated size. The
outcome is a minimum spanning tree (MST), where
each vertex is a cuboid and an edge from i to j
shows that i is the smallest parent of
j. Available memory is not usually enough to
compute all the cuboids in MST together, so we
need to decide what cuboids can be computed
together (sub-MST), and when to allocate and
deallocate memory for different hash-tables and
what attribute to use for partitioning data.
32
PipeHash
Input lattice and estimated sizes of
cuboids Initialize worklist with MST of the
search lattice While worklist is not
empty Pick tree from worklist T
Select-subtree of T to be executed
next Compute-subtree(T)
33
Select-subtree
Select-subtree(T) If memory required by T less
than available, return(T) Else, let S be the
attributes in root(T) For any s ? S we get a
subtree Ts of T also rooted at T including all
cuboids that contain s Ps maximum number of
partitions of root(T) possible if partitioned on
s. . Choose s such that mem(Ts)/Ps lt memory
available and Ts the largest over all subsets of
S. Remove Ts from T. (put T-Ts in worklist)
34
Compute-subtree
Compute-subtree numP mem(T) f /
mem-available Partition root of T into
numP For each partition of root(T) For
each node n in T Compute all children of n
in one scan If n is cached, saved to disk and
release memory occupied by its hash table
35
Example
36
Remaining Subtrees
all
A B C D
AB BC CD BD
ABC BCD
ABCD
37
OVERLAP
Sorted-Runs Consider a cuboid on j attributes
A1,A2,,Aj, we use B (A1,A2,,Aj) to denote
the cuboid sorted on that order. Consider S
(A1,A2,,Al-1,Al1,,Aj), computed using the one
before. A sorted run R of S in B is defined
as R ?S (Q) where Q is a maximal sequence of
tuples of B such that for each tuple in Q, the
first l columns have the same value.
38
Sorted-run
B (a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(c,3,
1) S first and third attribute S
(a,2),(a,3),(b,3),(b,2),(c,1) Sorted runs
(a,2),(a,3) (a,2) (b,3) (b,2) (c,1)
39
Partitions
B and S have a common prefix (A1 Al-1) A
partition of the cuboid S in B is the union of
sorted runs such that the first l-1 columns of
all the tuples of the sorted runs have the same
values. (a,2),(a,3) (b,2),(b,3) (c,1)
40
OVERLAP
Sort the base cuboid this forces the sorted
order in which the other cuboids are computed
41
OVERLAP(2)
If there is enough memory to hold all the
cuboids, compute all. (very seldom true).
Otherwise, use the partition as a unit of
computation just need sufficient memory to hold
a partition. As soon as a partition is computed,
tuples can be pipelined to compute descendant
cuboids (same partition) and then written to
disk. Reuse then the memory to compute the next
partition.
Partitions(a,2),(a,3) (b,2),(b,3) (c,1)
Example XYZ-gtXZ
XYZ(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(C,3,
1)
42
OVERLAP(3)
Choose a parent to compute a cuboid DAG.
Goal minimize the size of the partitions of a
cuboid, so less memory is needed. E.g., it is
better to compute AC from ACD than from ABC,
(since the sort order matches and the partition
size is 1). This is a hard problem. Heuristic
maximize the size of the common prefix.
43
OVERLAP (4)
Choosing a set of cuboids for overlapped
computation, according to your memory
constraints. To compute a cuboid in memory, we
need memory equal to the size of its partition.
Partition sizes can be estimated from cuboid
sizes by using some distribution (uniform?)
assumption. If this much memory can be spared,
then the cuboid will be marked as in Partition
state. For other cuboids, allocate a single page
(for temporary results), these cuboids are in
SortRun state. A cuboid in partition state can
have its tuples pipelined for computation of its
descendants. A cuboid can be considered for
computation if it is the root, or its parent is
marked as in Partition State. The total
memory allocated to all cuboids cannot be more
than the available memory.
44
OVERHEAD (5)
Again, a hard problem Heuristic traverse the
tree in BFS manner.
45
OVERLAP (6)
Computing a cuboid from its parent Output
The sorted cuboid S foreach tuple ? of B
do if (state Partition) then
process_partition(?) else process_sorted_run(?
)
46
OVERLAP (7)
Process_partition If the input tuple
starts a new partition, output the current
partition at the end of the cuboid, start a new
one. If the input tuple matches with an
existing tuple in the partition, update the
aggregate. Else input tuple
aggregate. Process_sorted_run If input
tuple starts a new sorted_run, flush all the
pages of current sorted_run, and start a new
one. If the input tuple matches the last
tuple in the sorted_run, recompute the
aggregate. Else, append the tuple to the
end of the existing run.
47
Observations
In ABCD ? ABC, the partition size is 1.
Why? In ABCD ? ABD, the partition size is equal
to the number of distinct C values, Why? In ABCD
? BCD the partition size is the size of the
cuboid BCD, Why?
48
Running example with 25 pages
49
Other issues