Title: Chapt. 7 Multidimensional Hierarchical Clustering
1Chapt. 7 Multidimensional Hierarchical Clustering
Fig. 3.1 Hierarchies in the Juice and More
schema
2(b)
3Size of completely aggregated Cube (692011)(9
838)(64)(413) -----------------------------
------------------- (581910)(8727)(5
3)(312) 46691113
185.328 -------------------- -----------
7.96 larger than base cube 557719
23.275 Base Cube has 2.245.024.000 cells
4 B 9 GB Number of available facts 26
million
4Sparsity 26106 -------------- 0,0116 2,2
45 109 100 - 1.16 98.84 sparsity
5Hierarchically aggregated Cube (15407607600)
8406 (1856112784) 961 (1515)
21 (1324) 28 P 4.749.961.608 Size of
base cube 2.145.024.000 Number of aggregate
cells 2.504.937.608 gt Juice and More database
has 96 times more hierarchically aggregated cells
than occupied base cells!
6Star-Joins Restrictions on several dimension
tables, which are then joined with fact table In
addition grouping, computation of aggregates,
sorting of results. Example
7(No Transcript)
8- Key Question
- How to compute star-joins efficiently?
- Secondary indexes on foreign keys of fact
table (standard B-trees), see chapter 5 for
details - - intersect result lists
- retrieve tuples from fact table randomly
- Bitmaps
9 bitmap for organization
34 of
1.....1.11 1.1...1.1. 1.1...1.1. ...1.1....
..1.1...1.
TM
tuples
bitmap for region
32 of
11.1...... 1.11.....1 .1.1..1... 1.1.1.....
.1..1.1...
Asia
tuples
result of bitmap intersection
10 of
1......... 1.1....... ......1... ..........
....1.....
tuples
80 of
accessed disk pages
Page 1
Page 2
Page 3
Page 4
Page 5
pages
(shaded)
Bitmap Index Intersection
10Problem for small result sets of a few ,
almost all pages of the facts table must be
fetched from disk, if the hits in the result set
are not clustered on disk. Ex with 8 KB pages
20 to 400 tuples per page, i.e. at 0.25 to 5
hits in the result almost all pages must be
fetched. At least tuple clustering, preferably
page clustering, are desirable, but how?? Goal
Code hierarchies in such a way, that for
star-joins with the Fact table we have to join
only with a query box on the Fact table
11Basic Idea for Multidimensional Clustering
All
All Products
AppleJuice
Orange Juice
Apple Juice
Product Category
0
1
0
0,33L
0,7L
1L
0,5L
1
0
2
1
1L
Example Hierarchy in Member Set Representation
12Dimension D consists of Value Set V v1, v2,
... vn Hierarchy H of height h consisting of
h1 hierarchy levels H L0 , L1 ,..., Lh
Level Li is a set of sets m1i, ..., mji
with mki ? V mki get names, e.g. Orange
Juice as label(m11), in general
label(mki) Constraint every mli1 must be a
subset of some mki
13Hierarchic Relationships The children of mki are
all those sets mli1 of the lower level i1 with
the property mli1 ? mki , formally children(
mki ) mli1 ? Li1 mli1 ? mki
parent(mki ) mli-1 ? Li-1 mli-1 ? mki
Principle the children of m are numbered by
the bijective function ordm starting at 1 or 0
14Hierarchic Relationships The children of mki are
all those sets mli1 of the lower level i1 with
the property mli1 ? mki , formally children(
mki ) mli1 ? Li1 mli1 ? mki
parent(mki ) mli-1 ? Li-1 mli-1 ? mki
Principle the children of m are numbered by
the bijective function ordm starting at 1 or 0
15Enumeration and Surrogate Functions Let A be an
enumeration type A a0, a1, ... ak f A
--gt (0, 1 ,..., k ) defined as f (ai ) i then
i is called the surrogate of ai
16Hierarchies and composite Surrogates Basic Idea
concatenate the surogates of successive hierarchy
levels (compound surrogates cs) Note the root
ALL of the hierarchy is not encoded Def compound
surrogate cs for hierarchy H ordm children (m)
--gt 0, 1, ..., children(m) -1 cs (H, mi)
ord father (mi) (mi) if i1 cs (H,
father ( mi)) ? ord father (mi) (mi) otherwise
17Example
18Surrogates for Region and the entire Costumer
Hierarchy
19Example the path North America --gt USA --gt
Retail --gt Bar has the compound surrogate
4?1?1?2 Next Idea for every hierarchy level
determine the higest branching degree (plus a
safety margin for future extensions) and code by
fixed number of bits. surrogates (H,i) max
cardinality (children (H,m)) m ? level (H, i-1)
20let li ?log2 surrogates (H,i)? then li bits
are needed for the surrogates of level i let ?
be a path ? m0 ? m1 ? m2 ? ... ? mh to a
leaf mh of hierarchy H
21cs (H,?) cs (H,mh)
...
22Example cs (H, Bar) 100 001 1
010 538
l13 l23 l31 l43
number of bits needed at certain level
23- Properties of MHC Encoding
- very compact coding of fixed length
- lexicographic order of composite keys remains,
i.e. isomorphic to integer ordering - point restrictions on arbitrary hierarchy levels
lead to interval restrictions on the compound
surrogates
24Example path to USA is North America --gt
USA 4 1002 1 0012 leads to range on
cs 100 001 0 0002 to 100 001 1 1112
and to the decimal range 528 to 543
or 528 543 gt star join with restriction
North America.USA leads to an interval
restriction on the fact table gt point
restrictions on arbitrary hierarchy levels of
several dimensions lead to Query Boxes on the
fact table.
25- Complex Hierarchies
- time with months and weeks, both restrictions
lead to intervals on the level of days - Example of Fig. 4-4
- proposal for multiple hierarchies choose the
most useful (depending on the query profile) or
consider multiple hierarchies as several
independent hierarchies. Caution, this increases
the number of dimensions !!! - Time variant hierarchies extend by time
interval of validity , see Example Fig. 4-5,
26REGION
YEAR
NATION
CUSTOMER TYPE
MONTH
WEEK
TRADE TYPE
CUSTOMER SIZE
DAY
CUSTOMER
(b)
(a)
Fig. 4-4 Complex Hierarchy Graphs
27CUSTOMER
South Europe
North
America
...
USA
Canada
Retail
Wholesale
Bar
Restaurant
Year
lt 1997
Year
gt 1997
Joe
s Sports Bar
Fig. 4-5 Change of a hierarchy over the time
28(No Transcript)
29Processing a query box in sort order with the
Tetris algorithm