Title: Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions
1Compressed Data Cubes for OLAP Aggregate Query
Approximation on Continuous Dimensions
- Jayavel Shanmugasundaram
- University of Wisconsin
-
Usama Fayyad Paul Bradley
Microsoft Research
2Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
3Motivation
- Decision Support crucial
- Means to analyze large volumes of data
- Provides competitive advantage
- Data Cube Important for Decision Support
- Provides multi-dimensional view of data
- Shows the big picture allowing progressive
drill-down and roll-ups
4Data Cube Example
CustId City Age Salary
1 Madison 50 60000 2 Madison
50 58000 5 Milwaukee 30 65000
6 Redmond 27 80000 9
Redmond 30 150000 ...
.
5Data Cube Example (Contd.)
Select d.city, d.age, avg(d.salary) From Data d
Group By
Cube(d.city, d.age)
6Dimension Hierarchies
- Each dimension may have many distinct values
- In example many cities, many age groups
- Especially continuous dimension values
- salary as a dimension (like age)
7Dimension Hierarchies
- Each dimension may have many distinct values
- In example many cities, many age groups
- Especially continuous dimension values
- salary as a dimension (like age)
- Organized into hierarchies
- Example Cities into regions, regions into states
- Salary as 50-60K, 60K-65K, 65K-70K
8Dimension Hierarchies (Contd.)
Select state(d.city), agegroup(d.age),
avg(d.salary) From Data d
Group By
Cube(state(d.city), agegroup(d.age))
Age
Wise
Young
All
27
30
50
All
City
Madison
Wisconsin
Milwaukee
Redmond
Washington
9Data Cube Implementation
- Primary technique Precomputation
- Important (or all!) parts of data cube
precomputed - Queries answered in many cases using precomputed
data
10Problems with Precomputation
- Space overhead
- Precomputed cube can be orders of magnitude
larger than original data - 1M Data tuples gt 200M Cube Tuples
11Problems with Precomputation
- Space overhead
- Precomputed cube can be orders of magnitude
larger than original data - 1M Data tuples gt 200M Cube Tuples
- Additional space overhead for multiple aggregate
functions
Select d.city, d.age, avg(d.salary),
count() From Data d
Group By Cube(d.city, d.age)
12Problems with Precomputation
Select agegrp(d.age), salgrp(d.salary),
count() From Data d
Group By
Cube(agegrp(d.age), salgrp(d.salary))
- agegrp specifies ranges on ages
- Example 20-22, 22-25, 25-30, 30-34, ...
- salgrp specifies ranges on salary
- Example 50K-55K, 55K-65K, 65K-90K, ...
- agegrp and salgrp set dynamically
- Allows flexibility of drilling down on specific
age groups, salary brackets
13Problems with Precomputation
- Cannot efficiently answer such queries!
Select f(d.a), g(d.b), avg(d.c) From Data d
Group By Cube(f(d.a), g(d.b))
where f and g are not pre-specified
- Dynamic dimension hierarchy
- Cannot precompute higher aggregate cells
- Computation from base aggregates done on-line
- Bad Performance
14The Problem
- Compressed data cubes
- addresses one part of storage problem
- Storage independent of number of aggregate
functions - addresses another part of storage problem
- Ability to efficiently cube on dynamic dimension
hierarchies - handles continuous dimensions
15Necessary Evil
- Approximation!
- Loss of information due to huge compression
16Necessary Evil
- Approximation!
- Loss of information due to huge compression
- Refined Problem Statement
- Efficient, compressed data cube for dynamic
dimension hierarchies providing high accuracy for
queries
17Our Focus
- Continuous dimensions
- Examples time, age, salary etc.
- Framework extends to discrete dimensions
18Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
19Intuition
- Data records are points in a multi-dimensional
space
140000
120000
100000
Salary
80000
60000
40000
15
35
55
75
Age
20Intuition (Contd.)
- Key Observation
- If we know the multi-dimensional probability
density distribution (pdf), no need for data! - Aggregate values can be computed from the
probability density function - Notation
- For (age, salary) example, Pr(a, s) denotes
probability density function
21Computing Aggregates from PDF
- Aggregate queries specify
- Regions in multi-dimensional space
- Aggregation function of interest
140000
120000
Aggregate Function Count()
100000
Salary
80000
60000
40000
15
35
55
75
Age
22Computing Aggregates from PDF
- Number of Records in a region count()
23Computing Aggregates from PDF
- Number of Records in a region count()
- Sum of age in a region sum(d.age)
24Computing Aggregates from PDF
- Number of Records in a region count()
- Sum of age in a region sum(d.age)
25Taking Stock
- Single pdf representation used to handle various
aggregate functions (currently sum, count, avg) - Space saving
26Taking Stock
- Single pdf representation used to handle various
aggregate functions (currently sum, count, avg) - Space saving
- Main Issues
- Compact representation of pdf
- space saving
27Taking Stock
- Single pdf representation used to handle various
aggregate functions (currently sum, count, avg) - Space saving
- Main Issues
- Compact representation of pdf
- space saving
- Efficient integration of pdf
- time saving
28Taking Stock
- Single pdf representation used to handle various
aggregate functions (currently sum, count, avg) - Space saving
- Main Issues
- Compact representation of pdf
- space saving
- Efficient integration of pdf
- time saving
- Efficient generation of pdf
- Scaling to large data sizes
29Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
30Clustering
- Viewed as identifying dense regions of pdf
140000
clusters
120000
100000
Salary
outlier
80000
60000
40000
15
35
55
75
Age
- Each cluster representation approximates the data
points within the cluster - Outliers capture unusual data points
31Virtues of Clustering
- Compact density estimate storage
- PDF represented as mixture of Gaussians etc.
- Efficient Integration
- Gaussians with diagonal co-variance matrices
- Scales to large data sets
- Birch Zhang et. al, Scaleable EM Bradley et.
al - Trade off storage vs. accuracy
- More clusters gt more accuracy and storage
32Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
33Data Sets Used for Experiments
34Comparison with SQL Server
35Comparison with Sampling
36Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
37High Level Algorithm
C Initial Cluster Model While (C is not
sufficiently accurate) do Grow new clusters
in C where it is not sufficiently
accurate End while C is the required cluster
model
38Refining Accuracy of Model
- Determining accuracy of model
- Break interesting regions of space into tiles
- Determine accuracy of tiles
tiles
140000
120000
100000
Salary
80000
60000
40000
15
35
55
75
Age
- Improving accuracy of model
- Grow clusters near offending tiles
39Experimental Results
- Desired 90 accuracy 90 of the time
40Outline
- Motivation and Problem Definition
- Aggregation using Density Estimates
- Density Estimation using Clustering
- Performance Results
- Extensions for Improved Accuracy
- Conclusion and Future Work
41Conclusion
- Efficient querying over continuous dimension data
cubes - Dramatic space reduction using clustering
- Methods to reduce approximation error
- Orders of magnitude reduction is storage and
response time
42Related Work
- Concurrent Work Poosala Ganti
- Multi-dimensional histograms
- Separate compressed cube for each measure
- Fixed (pre-discretized) dimension hierachies
- Wavelets Vitter Wang
- Quasi-Cubes Barbara Sullivan