Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions

Description:

Allows flexibility of drilling down on specific age groups, salary brackets ... For (age, salary) example, Pr(a, s) denotes probability density function ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 43

Provided by: jayavelsha

Category:

more less

Transcript and Presenter's Notes

Title: Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions

1
Compressed Data Cubes for OLAP Aggregate Query
Approximation on Continuous Dimensions

Jayavel Shanmugasundaram
University of Wisconsin

Usama Fayyad Paul Bradley
Microsoft Research
2
Outline

Motivation and Problem Definition
Aggregation using Density Estimates
Density Estimation using Clustering
Performance Results
Extensions for Improved Accuracy
Conclusion and Future Work

3
Motivation

Decision Support crucial
Means to analyze large volumes of data
Provides competitive advantage
Data Cube Important for Decision Support
Provides multi-dimensional view of data
Shows the big picture allowing progressive
drill-down and roll-ups

4
Data Cube Example
CustId City Age Salary
1 Madison 50 60000 2 Madison
50 58000 5 Milwaukee 30 65000
6 Redmond 27 80000 9
Redmond 30 150000 ...
.
5
Data Cube Example (Contd.)
Select d.city, d.age, avg(d.salary) From Data d
Group By
Cube(d.city, d.age)
6
Dimension Hierarchies

Each dimension may have many distinct values
In example many cities, many age groups
Especially continuous dimension values
salary as a dimension (like age)

7
Dimension Hierarchies

Each dimension may have many distinct values
In example many cities, many age groups
Especially continuous dimension values
salary as a dimension (like age)
Organized into hierarchies
Example Cities into regions, regions into states
Salary as 50-60K, 60K-65K, 65K-70K

8
Dimension Hierarchies (Contd.)
Select state(d.city), agegroup(d.age),
avg(d.salary) From Data d
Group By
Cube(state(d.city), agegroup(d.age))
Age
Wise
Young
All
27
30
50
All
City
Madison
Wisconsin
Milwaukee
Redmond
Washington
9
Data Cube Implementation

Primary technique Precomputation
Important (or all!) parts of data cube
precomputed
Queries answered in many cases using precomputed
data

10
Problems with Precomputation

Space overhead
Precomputed cube can be orders of magnitude
larger than original data
1M Data tuples gt 200M Cube Tuples

11
Problems with Precomputation

Space overhead
Precomputed cube can be orders of magnitude
larger than original data
1M Data tuples gt 200M Cube Tuples
Additional space overhead for multiple aggregate
functions

Select d.city, d.age, avg(d.salary),
count() From Data d
Group By Cube(d.city, d.age)

12
Problems with Precomputation
Select agegrp(d.age), salgrp(d.salary),
count() From Data d
Group By
Cube(agegrp(d.age), salgrp(d.salary))

agegrp specifies ranges on ages
Example 20-22, 22-25, 25-30, 30-34, ...
salgrp specifies ranges on salary
Example 50K-55K, 55K-65K, 65K-90K, ...
agegrp and salgrp set dynamically
Allows flexibility of drilling down on specific
age groups, salary brackets

13
Problems with Precomputation

Cannot efficiently answer such queries!

Select f(d.a), g(d.b), avg(d.c) From Data d
Group By Cube(f(d.a), g(d.b))
where f and g are not pre-specified

Dynamic dimension hierarchy
Cannot precompute higher aggregate cells
Computation from base aggregates done on-line
Bad Performance

14
The Problem

Compressed data cubes
addresses one part of storage problem
Storage independent of number of aggregate
functions
addresses another part of storage problem
Ability to efficiently cube on dynamic dimension
hierarchies
handles continuous dimensions

15
Necessary Evil

Approximation!
Loss of information due to huge compression

16
Necessary Evil

Approximation!
Loss of information due to huge compression
Refined Problem Statement
Efficient, compressed data cube for dynamic
dimension hierarchies providing high accuracy for
queries

17
Our Focus

Continuous dimensions
Examples time, age, salary etc.
Framework extends to discrete dimensions

18
Outline

Motivation and Problem Definition
Aggregation using Density Estimates
Density Estimation using Clustering
Performance Results
Extensions for Improved Accuracy
Conclusion and Future Work

19
Intuition

Data records are points in a multi-dimensional
space

140000
120000
100000
Salary
80000
60000
40000
15
35
55
75
Age
20
Intuition (Contd.)

Key Observation
If we know the multi-dimensional probability
density distribution (pdf), no need for data!
Aggregate values can be computed from the
probability density function
Notation
For (age, salary) example, Pr(a, s) denotes
probability density function

21
Computing Aggregates from PDF

Aggregate queries specify
Regions in multi-dimensional space
Aggregation function of interest

140000
120000
Aggregate Function Count()
100000
Salary
80000
60000
40000
15
35
55
75
Age
22
Computing Aggregates from PDF

Number of Records in a region count()

23
Computing Aggregates from PDF

Number of Records in a region count()

Sum of age in a region sum(d.age)

24
Computing Aggregates from PDF

Number of Records in a region count()

Sum of age in a region sum(d.age)

Average Sum/Count etc.

25
Taking Stock

Single pdf representation used to handle various
aggregate functions (currently sum, count, avg)
Space saving

26
Taking Stock

Single pdf representation used to handle various
aggregate functions (currently sum, count, avg)
Space saving
Main Issues
Compact representation of pdf
space saving

27
Taking Stock

Single pdf representation used to handle various
aggregate functions (currently sum, count, avg)
Space saving
Main Issues
Compact representation of pdf
space saving
Efficient integration of pdf
time saving

28
Taking Stock

Single pdf representation used to handle various
aggregate functions (currently sum, count, avg)
Space saving
Main Issues
Compact representation of pdf
space saving
Efficient integration of pdf
time saving
Efficient generation of pdf
Scaling to large data sizes

29
Outline

Motivation and Problem Definition
Aggregation using Density Estimates
Density Estimation using Clustering
Performance Results
Extensions for Improved Accuracy
Conclusion and Future Work

30
Clustering

Viewed as identifying dense regions of pdf

140000
clusters
120000
100000
Salary
outlier
80000
60000
40000
15
35
55
75
Age

Each cluster representation approximates the data
points within the cluster
Outliers capture unusual data points

31
Virtues of Clustering

Compact density estimate storage
PDF represented as mixture of Gaussians etc.
Efficient Integration
Gaussians with diagonal co-variance matrices
Scales to large data sets
Birch Zhang et. al, Scaleable EM Bradley et.
al
Trade off storage vs. accuracy
More clusters gt more accuracy and storage

32
Outline

Motivation and Problem Definition
Aggregation using Density Estimates
Density Estimation using Clustering
Performance Results
Extensions for Improved Accuracy
Conclusion and Future Work

33
Data Sets Used for Experiments
34
Comparison with SQL Server
35
Comparison with Sampling
36
Outline

Motivation and Problem Definition
Aggregation using Density Estimates
Density Estimation using Clustering
Performance Results
Extensions for Improved Accuracy
Conclusion and Future Work

37
High Level Algorithm
C Initial Cluster Model While (C is not
sufficiently accurate) do Grow new clusters
in C where it is not sufficiently
accurate End while C is the required cluster
model
38
Refining Accuracy of Model