MDL Summarization with Holes - PowerPoint PPT Presentation

About This Presentation
Title:

MDL Summarization with Holes

Description:

clothes. New York. Vancouver. Edmonton. San Jose. San Francisco. Chicago. Minneapolis. Boston ... clothes. R2 ?R3. R4 ?R1. New York. Vancouver. Edmonton. San ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 37
Provided by: csU61
Category:

less

Transcript and Presenter's Notes

Title: MDL Summarization with Holes


1
MDL Summarization with Holes
  • Shaofeng Bu
  • Laks V.S. Lakshmanan
  • Raymond T. Ng
  • University of British Columbia, Canada

2
Introduction
  • Multi-dimensional OLAP queries typically produce
    data intensive answers
  • Often the question is how to express the large
    answer set of cells that satisfy the OLAP query
    conditions
  • Simple enumeration accurate but not necessarily
    the most intuitive
  • Summaries not (necessarily) 100 accurate but
    can be more intuitive and informative.
  • Summarized answers can be more easily understood

3
OLAP Data Cube Example
clothes
  • Each dimension is associated with a hierarchical
    tree

womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
4
OLAP Data Cube Example
clothes
  • Data Cell (c1,c2), c1,c2 are leaf-nodes
  • in axis-trees, e.g. (Vancouver, ties)
  • Data Region describes all data cells covered by
    given nodes in the axis-trees, (x1, y1), e.g.
  • (Vancouver, ties)
  • (Vancouver, womens)
  • (midwest, womens)

womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
5
OLAP Data Cube Example
clothes
  • Blue cells the cells that satisfy the query
    conditions
  • How to find a summary of the blue cells in a data
    cube?

womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
6
MDL Summarization
  • MDL Minimum Description Length
  • Use regions to cover the blue cells
  • Length of an MDL description is the number of
    included regions and cells
  • MDL is to find the description with the minimum
    length.

7
An Example of MDL Summarization
clothes
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
8
A Motivating Example A New Case
clothes
womens
mens
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
Vancouver
Edmonton
northwest
San Jose
San Francisco
Chicago
midwest
location
Minneapolis
Boston
Summit
northeast
Albany
New York
9
Can we do better?
  • Yes!
  • We present a new compression approach MDL with
    Holes
  • Identify regions with blue cells, even if they
    contain non-blue cells
  • Express the included blue cells by using regions
    with the exception of the covered non-blue cells
  • Non-blue cells are called holes.

10
A Motivating Example MDL with Holes
clothes
R1-(Vancouver,Skirts)
womens
mens
R3-(Vancouver,Skirts)
womens jeans
mens jeans
formal wear
dress pants
dress skirts
jackets
blouses
tops
skirts
ties
R9-(Boston,ties) -(New York, dress skirts)
Vancouver
Edmonton
northwest
San Jose
San Francisco
  • MDL with Holes
  • Length 63312
  • MDL Approach
  • Length is 18

Chicago
location
midwest
Minneapolis
Boston
Summit
northeast
Albany
New York
11
Problem Statements
  • MDL with Holes (MDLH) is to find a description
    with holes that has the minimum length and the
    maximum benefit.
  • In practice, we can drill down on regions to get
    additional details.

12
Definitions Length Benefit
  • Given a set B of data cells (blue cells), an MDLH
    description for B
  • DS H ,
  • S is a set of data regions,
  • H is a set of data cells, also called holes,
  • D covers exactly the data cells in B.
  • Length total number of the included regions and
    cells in the description.
  • DSH
  • Benefit how much shorter is the MDLH summary
    than the enumeration of B.
  • Benefit (D) B D
  • B1a, b, c
  • D1 s d
  • D12
  • Benefit(D1) B1 - D1 1
  • B2e, g
  • D2 t f h
  • D2 3
  • Benefit(D2) B2 - D2 -1

13
Related Work
  • The Generalized MDL Approach for Summarization,
    Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB
    2002
  • Reduce description length by allowing non-blue
    cells to be covered in the regions
  • The regions are not pure.
  • Concise Descriptions of Subsets of Structured
    Sets, Alberto O. Mendelzon Ken Q. Pu, PODS 2003
  • Allow Cartesian products to be formed
  • Not purely hierarchical NP Completeness result
    is less surprising
  • What about the pure hierarchical?
  • Intelligent Rollups in Multidimensional OLAP
    Data, Gayatri Sathe and Sunita Sarawagi, VLDB
    2001
  • Only report consistent generalization A tuple
    can be generalized along a set of dimensions only
    if it can be generalized along all subsets of
    dimensions.

14
Outline
  • Introduction to MDL with Holes
  • A motivating example
  • 1-D Case MDLH is Tractable
  • 2-D Case MDLH is NP-Complete
  • Heuristics
  • A Greedy Heuristic
  • Dynamic Programming
  • Quadratic Programming
  • Experimental Results
  • Summarization on Holes An Extension
  • Conclusions Contributions

15
1-D Case MDLH is Tractable
  • MDLH is Tractable the Optimal MDLH description,
    which has the maximum benefit, can be generated
    in polynomial time in 1-D case.
  • x
  • D1 x d f j
  • Benefit(D1) 7 4 3
  • D2(s d ) e ( u j )
  • Beneift(D2) 7 5 2
  • y
  • D3 y m p q r
  • Benefit(D3) 4 5 -1
  • D4 ( v m ) o ,
  • Benefit(D4) 4 3 1
  • z
  • D5 z d f j m p q r
  • Benefit(D5) 11 8 3
  • D6(x d f j)( v m o )
  • Benefit(D6) 11 7 4

16
Outline
  • Introduction to MDL with Holes
  • A motivating example
  • 1-D Case MDLH is Tractable
  • 2-D Case MDLH is NP-Hard
  • Heuristics
  • A Greedy Heuristic
  • Dynamic Programming
  • Quadratic Programming
  • Experimental Results
  • Summarization on Holes An Extension
  • Conclusions Contributions

17
2-D Case Optimality is not Preserved Any More
8
rows length benefit
1
2
3
4
5
6
7
(f,8),(g,8) 3 2
a
b
(c,8),(d,8),(e,8) 4 0
c
(a,8),(b,8) 5 -2
d
i
e
f
columns length benefit
g
(i,1) 3 2
Optimal Solution (c,8)(d,8)(e,8)(i,2)(i,e)(
i,4) -(c,2)(c,3)(c,4)(d,2)(d,3)(d,4) (e,2
)(e,3)(e,4) (f,1)(g,1)(f,6)(g,7) Length
19 Benefit 28-19 9
(i,5) 5 -2
18
MDLH is NP-Hard in 2-D Case
  • It is NP-Hard to find the optimal MDLH
    description in 2-D data cube
  • Not a Trivial Proof Details are in the paper
  • Reduction Strategy

Maximum Induced Subgraph in Complete
Edge-Weighted(CEW) Bipartite Graph
MDL with Holes
19
Outline
  • Introduction to MDL with Holes
  • A motivating example
  • 1-D Case MDLH is Tractable
  • 2-D Case MDLH is NP-Hard
  • Heuristics
  • A Greedy Heuristic
  • Dynamic Programming
  • Quadratic Programming
  • Experimental Results
  • Summarization on Holes An Extension
  • Conclusions Contributions

20
Heuristics for MDLH
  • Greedy
  • Each time, choose the row/column with the most
    benefit
  • Dynamic Programming
  • A bottom-up method to get the description of a
    region from the descriptions of its children
    regions
  • Quadratic Programming
  • Using a quadratic function to represent the
    benefit of a 2-d data cube

21
Example for Comparison with Heuristics
  • The optimal description for this example
  • (e,1)-(a,1)(e,2)-(b,2)(e,3)-(b,3)(d,4)(b,5)
  • (e,6)(e,8)(a,11)-(a,8)
  • Length 12
  • Benefit 8

22
Heuristics A Greedy Heuristic
region length benefit holes
(e,6) 1 3 -
(d,10) 2 2 (d,5)
(e,1) 2 1 (a,1)
(e,2) 2 1 (b,2)
(e,3) 2 1 (b,3)
(e,8) 2 1 (a,8)
(a,11) 2 1 (a,8)
(c,10) 3 0 (c,4)(c,5)
23
Greedy Why it is not optimal?
Description from Greedy
  • A selection of row/column may reduce more total
    benefit

24
Heuristics Dynamic Programming
L The Length of a Region
t2
1 2 3 4 5 10 6 7 8 9 11 12
a 2 2 4
b 2 2 4
c 3 2 5
d 2 2 4
e 2 2 2 1 1 8 1 1 2 1 5 13
t1
S Selection of Rows Columns
1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
  • (a,10) (a,2) (a,3)
  • L(a,10)2, S(a,10)t2
  • (e,4) (d,4)
  • L(e,4)1, S(e,4)t1
  • (d,10) (d,10) (d,5)
  • L(d,10)2, S(d,10)g

25
Heuristics Dynamic Programming(2)
S 1 2 3 4 5 10 6 7 8 9 11 12
a t2 g t2
b t2 t2 t2
c t2 t2 t2
d g t2 t2
e g g g t1 t1 t2 g t1 g t1 t2 t2
t2
t1
D(x1,x2)description for region (x1,x2)
S (e,12)t2
S (e,10)t2
S (e,11)t2
Generated Description (e,1)-(a,1)(e,2)-(b,2)(e,
3)-(b,3)(d,4)(b,5) (e,6)(a,7)(e,8)-(a,8)(a,9
) The length is 13 and the benefit is 20-13 7
26
Dynamic Programming Why it is not optimal?
Description by Dynamic Programming
Optimal Description
  • Misses the combination of rows and columns

27
Heuristics Quadratic Programming
  • Use variables to represent rows/columns for a
    variable v
  • v1 the corresponding row/column is selected
  • v0 the corresponding row/column is not
    selected
  • f Benefit( D)
  • Maximizing the benefit is to minimize the value
    of f
  • For the previous example, quadratic programming
    generates the optimal description
  • Optimality is not guaranteed.

28
Outline
  • Introduction to MDL with Holes
  • A motivating example
  • 1-D Case MDLH is Tractable
  • 2-D Case MDLH is NP-Hard
  • Heuristics
  • A Greedy Heuristic
  • Dynamic Programming
  • Quadratic Programming
  • Experimental Results
  • Summarization on Holes An Extension
  • Conclusions Contributions

29
Experiments
  • We ran a set of experiments on the TPC-H
    benchmark data set
  • We compared the three MDLH heuristics with MDL
    and GMDL.

30
Experimental Results Comparison of All Methods
  • Compression Ratio
  • MDLH-Quadratic generates the most concise
    descriptions a yardstick of quality
  • MDLH-Dynamic is a very close second.

31
Experimental Results Compression Ratio
  • The more children per parent node, the greater
    the benefit

32
Experimental Results Running time
  • Running time Scalability
  • MDLH-Greedy is the fastest
  • MDLH-Dynamic runs slower than MDLH-Greedy, but it
    is still scalable w.r.t. the number of cells

33
Outline
  • Introduction to MDL with Holes
  • A motivating example
  • 1-D Case MDLH is Tractable
  • 2-D Case MDLH is NP-Hard
  • Heuristics
  • A Greedy Heuristic
  • Dynamic Programming
  • Quadratic Programming
  • Experimental Results
  • Summarization on Holes An Extension
  • Conclusions Contributions

34
Extension Summarization on holes
  • As the blue density becomes high, a large part of
    the MDLH description is made up of holes.
  • Can we further reduce the total length by
    summarizing Holes?
  • MDLH description is
  • (a,11)-(a,6)(a,8)(a,9)
  • (d,11)-(d,6)(d,7)(d,8) (b,6)(c,8)
  • Total length is 10.
  • Summarization on holes
  • (a,6)(a,8)(a,9) (a,10)-(a,7)
  • (d,6)(d,7)(d,8) (d,10)-(d,9)
  • After summarization on holes
  • (a,11) - (a,10) - (a,7)
  • (d,11) - (d,10) - (d,9)
  • (b,6) (c,8)
  • Total length is 8.

35
Conclusions Contributions
  • We present a new method, MDLH, to compress the
    answers of OLAP queries
  • We present a bottom-up algorithm for 1-d cube
  • We proved the NP-Hardness of the MDLH problem
  • We provided three heuristics for MDLH greedy,
    dynamic programming, and quadratic programming
  • We extended the summarization on holes to further
    reduce the total length
  • We did a set of experiments on the TPC-H
    benchmark data to compare the heuristics.

36
On going work
  • Based on the summarization on blue cells and
    summarization on holes, build a visualization
    tool with MDLH summarization
  • Return summarized answers to users queries
  • Provide drill down operation for users
  • Browse details on blue cells
  • Browse details on holes
  • Design k-approximation algorithm for MDLH
  • What is the best quality we can guarantee?
Write a Comment
User Comments (0)
About PowerShow.com