Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions

Description:

Allows flexibility of drilling down on specific age groups, salary brackets ... For (age, salary) example, Pr(a, s) denotes probability density function ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 43
Provided by: jayavelsha
Category:

less

Transcript and Presenter's Notes

Title: Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions


1
Compressed Data Cubes for OLAP Aggregate Query
Approximation on Continuous Dimensions
  • Jayavel Shanmugasundaram
  • University of Wisconsin

Usama Fayyad Paul Bradley
Microsoft Research
2
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

3
Motivation
  • Decision Support crucial
  • Means to analyze large volumes of data
  • Provides competitive advantage
  • Data Cube Important for Decision Support
  • Provides multi-dimensional view of data
  • Shows the big picture allowing progressive
    drill-down and roll-ups

4
Data Cube Example
CustId City Age Salary
1 Madison 50 60000 2 Madison
50 58000 5 Milwaukee 30 65000
6 Redmond 27 80000 9
Redmond 30 150000 ...
.
5
Data Cube Example (Contd.)
Select d.city, d.age, avg(d.salary) From Data d
Group By
Cube(d.city, d.age)
6
Dimension Hierarchies
  • Each dimension may have many distinct values
  • In example many cities, many age groups
  • Especially continuous dimension values
  • salary as a dimension (like age)

7
Dimension Hierarchies
  • Each dimension may have many distinct values
  • In example many cities, many age groups
  • Especially continuous dimension values
  • salary as a dimension (like age)
  • Organized into hierarchies
  • Example Cities into regions, regions into states
  • Salary as 50-60K, 60K-65K, 65K-70K

8
Dimension Hierarchies (Contd.)
Select state(d.city), agegroup(d.age),
avg(d.salary) From Data d
Group By
Cube(state(d.city), agegroup(d.age))
Age
Wise
Young
All
27
30
50
All
City
Madison
Wisconsin
Milwaukee
Redmond
Washington
9
Data Cube Implementation
  • Primary technique Precomputation
  • Important (or all!) parts of data cube
    precomputed
  • Queries answered in many cases using precomputed
    data

10
Problems with Precomputation
  • Space overhead
  • Precomputed cube can be orders of magnitude
    larger than original data
  • 1M Data tuples gt 200M Cube Tuples

11
Problems with Precomputation
  • Space overhead
  • Precomputed cube can be orders of magnitude
    larger than original data
  • 1M Data tuples gt 200M Cube Tuples
  • Additional space overhead for multiple aggregate
    functions

Select d.city, d.age, avg(d.salary),
count() From Data d
Group By Cube(d.city, d.age)

12
Problems with Precomputation
Select agegrp(d.age), salgrp(d.salary),
count() From Data d
Group By
Cube(agegrp(d.age), salgrp(d.salary))
  • agegrp specifies ranges on ages
  • Example 20-22, 22-25, 25-30, 30-34, ...
  • salgrp specifies ranges on salary
  • Example 50K-55K, 55K-65K, 65K-90K, ...
  • agegrp and salgrp set dynamically
  • Allows flexibility of drilling down on specific
    age groups, salary brackets

13
Problems with Precomputation
  • Cannot efficiently answer such queries!

Select f(d.a), g(d.b), avg(d.c) From Data d
Group By Cube(f(d.a), g(d.b))
where f and g are not pre-specified
  • Dynamic dimension hierarchy
  • Cannot precompute higher aggregate cells
  • Computation from base aggregates done on-line
  • Bad Performance

14
The Problem
  • Compressed data cubes
  • addresses one part of storage problem
  • Storage independent of number of aggregate
    functions
  • addresses another part of storage problem
  • Ability to efficiently cube on dynamic dimension
    hierarchies
  • handles continuous dimensions

15
Necessary Evil
  • Approximation!
  • Loss of information due to huge compression

16
Necessary Evil
  • Approximation!
  • Loss of information due to huge compression
  • Refined Problem Statement
  • Efficient, compressed data cube for dynamic
    dimension hierarchies providing high accuracy for
    queries

17
Our Focus
  • Continuous dimensions
  • Examples time, age, salary etc.
  • Framework extends to discrete dimensions

18
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

19
Intuition
  • Data records are points in a multi-dimensional
    space

140000
120000
100000
Salary
80000
60000
40000
15
35
55
75
Age
20
Intuition (Contd.)
  • Key Observation
  • If we know the multi-dimensional probability
    density distribution (pdf), no need for data!
  • Aggregate values can be computed from the
    probability density function
  • Notation
  • For (age, salary) example, Pr(a, s) denotes
    probability density function

21
Computing Aggregates from PDF
  • Aggregate queries specify
  • Regions in multi-dimensional space
  • Aggregation function of interest

140000
120000
Aggregate Function Count()
100000
Salary
80000
60000
40000
15
35
55
75
Age
22
Computing Aggregates from PDF
  • Number of Records in a region count()

23
Computing Aggregates from PDF
  • Number of Records in a region count()
  • Sum of age in a region sum(d.age)

24
Computing Aggregates from PDF
  • Number of Records in a region count()
  • Sum of age in a region sum(d.age)
  • Average Sum/Count etc.

25
Taking Stock
  • Single pdf representation used to handle various
    aggregate functions (currently sum, count, avg)
  • Space saving

26
Taking Stock
  • Single pdf representation used to handle various
    aggregate functions (currently sum, count, avg)
  • Space saving
  • Main Issues
  • Compact representation of pdf
  • space saving

27
Taking Stock
  • Single pdf representation used to handle various
    aggregate functions (currently sum, count, avg)
  • Space saving
  • Main Issues
  • Compact representation of pdf
  • space saving
  • Efficient integration of pdf
  • time saving

28
Taking Stock
  • Single pdf representation used to handle various
    aggregate functions (currently sum, count, avg)
  • Space saving
  • Main Issues
  • Compact representation of pdf
  • space saving
  • Efficient integration of pdf
  • time saving
  • Efficient generation of pdf
  • Scaling to large data sizes

29
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

30
Clustering
  • Viewed as identifying dense regions of pdf

140000
clusters
120000
100000
Salary
outlier
80000
60000
40000
15
35
55
75
Age
  • Each cluster representation approximates the data
    points within the cluster
  • Outliers capture unusual data points

31
Virtues of Clustering
  • Compact density estimate storage
  • PDF represented as mixture of Gaussians etc.
  • Efficient Integration
  • Gaussians with diagonal co-variance matrices
  • Scales to large data sets
  • Birch Zhang et. al, Scaleable EM Bradley et.
    al
  • Trade off storage vs. accuracy
  • More clusters gt more accuracy and storage

32
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

33
Data Sets Used for Experiments
34
Comparison with SQL Server
35
Comparison with Sampling
36
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

37
High Level Algorithm
C Initial Cluster Model While (C is not
sufficiently accurate) do Grow new clusters
in C where it is not sufficiently
accurate End while C is the required cluster
model
38
Refining Accuracy of Model
  • Determining accuracy of model
  • Break interesting regions of space into tiles
  • Determine accuracy of tiles

tiles
140000
120000
100000
Salary
80000
60000
40000
15
35
55
75
Age
  • Improving accuracy of model
  • Grow clusters near offending tiles

39
Experimental Results
  • Desired 90 accuracy 90 of the time

40
Outline
  • Motivation and Problem Definition
  • Aggregation using Density Estimates
  • Density Estimation using Clustering
  • Performance Results
  • Extensions for Improved Accuracy
  • Conclusion and Future Work

41
Conclusion
  • Efficient querying over continuous dimension data
    cubes
  • Dramatic space reduction using clustering
  • Methods to reduce approximation error
  • Orders of magnitude reduction is storage and
    response time

42
Related Work
  • Concurrent Work Poosala Ganti
  • Multi-dimensional histograms
  • Separate compressed cube for each measure
  • Fixed (pre-discretized) dimension hierachies
  • Wavelets Vitter Wang
  • Quasi-Cubes Barbara Sullivan
Write a Comment
User Comments (0)
About PowerShow.com