Title: MDL Summarization with Holes
1MDL Summarization with Holes
- Shaofeng Bu
- Laks V.S. Lakshmanan
- Raymond T. Ng
- University of British Columbia, Canada
2Introduction
- Multi-dimensional OLAP queries typically produce
data intensive answers - Often the question is how to express the large
answer set of cells that satisfy the OLAP query
conditions - Simple enumeration accurate but not necessarily
the most intuitive - Summaries not (necessarily) 100 accurate but
can be more intuitive and informative. - Summarized answers can be more easily understood
3OLAP Data Cube Example
clothes
- Each dimension is associated with a hierarchical
tree
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
4OLAP Data Cube Example
clothes
- Data Cell (c1,c2), c1,c2 are leaf-nodes
- in axis-trees, e.g. (Vancouver, ties)
- Data Region describes all data cells covered by
given nodes in the axis-trees, (x1, y1), e.g. - (Vancouver, ties)
- (Vancouver, womens)
- (midwest, womens)
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
5OLAP Data Cube Example
clothes
- Blue cells the cells that satisfy the query
conditions - How to find a summary of the blue cells in a data
cube?
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
6MDL Summarization
- MDL Minimum Description Length
- Use regions to cover the blue cells
- Length of an MDL description is the number of
included regions and cells - MDL is to find the description with the minimum
length.
7An Example of MDL Summarization
clothes
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
8A Motivating Example A New Case
clothes
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
9Can we do better?
- Yes!
- We present a new compression approach MDL with
Holes - Identify regions with blue cells, even if they
contain non-blue cells - Express the included blue cells by using regions
with the exception of the covered non-blue cells - Non-blue cells are called holes.
10A Motivating Example MDL with Holes
clothes
R1-(Vancouver,Skirts)
womens
mens
R3-(Vancouver,Skirts)
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
R9-(Boston,ties) -(New York, dress skirts)
Vancouver
Edmonton
northwest
San Jose
San Francisco
- MDL with Holes
- Length 63312
- MDL Approach
- Length is 18
Chicago
location
midwest
Minneapolis
Boston
Summit
northeast
Albany
New York
11Problem Statements
- MDL with Holes (MDLH) is to find a description
with holes that has the minimum length and the
maximum benefit. - In practice, we can drill down on regions to get
additional details.
12Definitions Length Benefit
- Given a set B of data cells (blue cells), an MDLH
description for B - DS H ,
- S is a set of data regions,
- H is a set of data cells, also called holes,
- D covers exactly the data cells in B.
- Length total number of the included regions and
cells in the description. - DSH
- Benefit how much shorter is the MDLH summary
than the enumeration of B. - Benefit (D) B D
- B1a, b, c
- D1 s d
- D12
- Benefit(D1) B1 - D1 1
- B2e, g
- D2 t f h
- D2 3
- Benefit(D2) B2 - D2 -1
13Related Work
- The Generalized MDL Approach for Summarization,
Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB
2002 - Reduce description length by allowing non-blue
cells to be covered in the regions - The regions are not pure.
- Concise Descriptions of Subsets of Structured
Sets, Alberto O. Mendelzon Ken Q. Pu, PODS 2003 - Allow Cartesian products to be formed
- Not purely hierarchical NP Completeness result
is less surprising - What about the pure hierarchical?
- Intelligent Rollups in Multidimensional OLAP
Data, Gayatri Sathe and Sunita Sarawagi, VLDB
2001 - Only report consistent generalization A tuple
can be generalized along a set of dimensions only
if it can be generalized along all subsets of
dimensions.
14Outline
- Introduction to MDL with Holes
- A motivating example
- 1-D Case MDLH is Tractable
- 2-D Case MDLH is NP-Complete
- Heuristics
- A Greedy Heuristic
- Dynamic Programming
- Quadratic Programming
- Experimental Results
- Summarization on Holes An Extension
- Conclusions Contributions
151-D Case MDLH is Tractable
- MDLH is Tractable the Optimal MDLH description,
which has the maximum benefit, can be generated
in polynomial time in 1-D case.
- x
- D1 x d f j
- Benefit(D1) 7 4 3
- D2(s d ) e ( u j )
- Beneift(D2) 7 5 2
- y
- D3 y m p q r
- Benefit(D3) 4 5 -1
- D4 ( v m ) o ,
- Benefit(D4) 4 3 1
- z
- D5 z d f j m p q r
- Benefit(D5) 11 8 3
- D6(x d f j)( v m o )
- Benefit(D6) 11 7 4
16Outline
- Introduction to MDL with Holes
- A motivating example
- 1-D Case MDLH is Tractable
- 2-D Case MDLH is NP-Hard
- Heuristics
- A Greedy Heuristic
- Dynamic Programming
- Quadratic Programming
- Experimental Results
- Summarization on Holes An Extension
- Conclusions Contributions
172-D Case Optimality is not Preserved Any More
8
rows length benefit
1
2
3
4
5
6
7
(f,8),(g,8) 3 2
a
b
(c,8),(d,8),(e,8) 4 0
c
(a,8),(b,8) 5 -2
d
i
e
f
columns length benefit
g
(i,1) 3 2
Optimal Solution (c,8)(d,8)(e,8)(i,2)(i,e)(
i,4) -(c,2)(c,3)(c,4)(d,2)(d,3)(d,4) (e,2
)(e,3)(e,4) (f,1)(g,1)(f,6)(g,7) Length
19 Benefit 28-19 9
(i,5) 5 -2
18MDLH is NP-Hard in 2-D Case
- It is NP-Hard to find the optimal MDLH
description in 2-D data cube - Not a Trivial Proof Details are in the paper
- Reduction Strategy
Maximum Induced Subgraph in Complete
Edge-Weighted(CEW) Bipartite Graph
MDL with Holes
19Outline
- Introduction to MDL with Holes
- A motivating example
- 1-D Case MDLH is Tractable
- 2-D Case MDLH is NP-Hard
- Heuristics
- A Greedy Heuristic
- Dynamic Programming
- Quadratic Programming
- Experimental Results
- Summarization on Holes An Extension
- Conclusions Contributions
20Heuristics for MDLH
- Greedy
- Each time, choose the row/column with the most
benefit - Dynamic Programming
- A bottom-up method to get the description of a
region from the descriptions of its children
regions - Quadratic Programming
- Using a quadratic function to represent the
benefit of a 2-d data cube
21Example for Comparison with Heuristics
- The optimal description for this example
- (e,1)-(a,1)(e,2)-(b,2)(e,3)-(b,3)(d,4)(b,5)
- (e,6)(e,8)(a,11)-(a,8)
- Length 12
- Benefit 8
22Heuristics A Greedy Heuristic
region length benefit holes
(e,6) 1 3 -
(d,10) 2 2 (d,5)
(e,1) 2 1 (a,1)
(e,2) 2 1 (b,2)
(e,3) 2 1 (b,3)
(e,8) 2 1 (a,8)
(a,11) 2 1 (a,8)
(c,10) 3 0 (c,4)(c,5)
23Greedy Why it is not optimal?
Description from Greedy
- A selection of row/column may reduce more total
benefit
24Heuristics Dynamic Programming
L The Length of a Region
t2
1 2 3 4 5 10 6 7 8 9 11 12
a 2 2 4
b 2 2 4
c 3 2 5
d 2 2 4
e 2 2 2 1 1 8 1 1 2 1 5 13
t1
S Selection of Rows Columns
1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
- (a,10) (a,2) (a,3)
- L(a,10)2, S(a,10)t2
- (e,4) (d,4)
- L(e,4)1, S(e,4)t1
- (d,10) (d,10) (d,5)
- L(d,10)2, S(d,10)g
25Heuristics Dynamic Programming(2)
S 1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
t2
t1
D(x1,x2)description for region (x1,x2)
S (e,12)t2
S (e,10)t2
S (e,11)t2
Generated Description (e,1)-(a,1)(e,2)-(b,2)(e,
3)-(b,3)(d,4)(b,5) (e,6)(a,7)(e,8)-(a,8)(a,9
) The length is 13 and the benefit is 20-13 7
26Dynamic Programming Why it is not optimal?
Description by Dynamic Programming
Optimal Description
- Misses the combination of rows and columns
27Heuristics Quadratic Programming
- Use variables to represent rows/columns for a
variable v - v1 the corresponding row/column is selected
- v0 the corresponding row/column is not
selected - f Benefit( D)
- Maximizing the benefit is to minimize the value
of f - For the previous example, quadratic programming
generates the optimal description - Optimality is not guaranteed.
28Outline
- Introduction to MDL with Holes
- A motivating example
- 1-D Case MDLH is Tractable
- 2-D Case MDLH is NP-Hard
- Heuristics
- A Greedy Heuristic
- Dynamic Programming
- Quadratic Programming
- Experimental Results
- Summarization on Holes An Extension
- Conclusions Contributions
29Experiments
- We ran a set of experiments on the TPC-H
benchmark data set - We compared the three MDLH heuristics with MDL
and GMDL.
30Experimental Results Comparison of All Methods
- Compression Ratio
- MDLH-Quadratic generates the most concise
descriptions a yardstick of quality - MDLH-Dynamic is a very close second.
31Experimental Results Compression Ratio
- The more children per parent node, the greater
the benefit
32Experimental Results Running time
- Running time Scalability
- MDLH-Greedy is the fastest
- MDLH-Dynamic runs slower than MDLH-Greedy, but it
is still scalable w.r.t. the number of cells
33Outline
- Introduction to MDL with Holes
- A motivating example
- 1-D Case MDLH is Tractable
- 2-D Case MDLH is NP-Hard
- Heuristics
- A Greedy Heuristic
- Dynamic Programming
- Quadratic Programming
- Experimental Results
- Summarization on Holes An Extension
- Conclusions Contributions
34Extension Summarization on holes
- As the blue density becomes high, a large part of
the MDLH description is made up of holes. - Can we further reduce the total length by
summarizing Holes?
- MDLH description is
- (a,11)-(a,6)(a,8)(a,9)
- (d,11)-(d,6)(d,7)(d,8) (b,6)(c,8)
- Total length is 10.
- Summarization on holes
- (a,6)(a,8)(a,9) (a,10)-(a,7)
- (d,6)(d,7)(d,8) (d,10)-(d,9)
- After summarization on holes
- (a,11) - (a,10) - (a,7)
- (d,11) - (d,10) - (d,9)
- (b,6) (c,8)
- Total length is 8.
35Conclusions Contributions
- We present a new method, MDLH, to compress the
answers of OLAP queries - We present a bottom-up algorithm for 1-d cube
- We proved the NP-Hardness of the MDLH problem
- We provided three heuristics for MDLH greedy,
dynamic programming, and quadratic programming - We extended the summarization on holes to further
reduce the total length - We did a set of experiments on the TPC-H
benchmark data to compare the heuristics.
36On going work
- Based on the summarization on blue cells and
summarization on holes, build a visualization
tool with MDLH summarization - Return summarized answers to users queries
- Provide drill down operation for users
- Browse details on blue cells
- Browse details on holes
- Design k-approximation algorithm for MDLH
- What is the best quality we can guarantee?