Mining Multidimensional Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Mining Multidimensional Databases

Description:

Dimension attributes (e.g., Store, Product, Data) Measure attributes (e.g., Sale, Price) ... Store. Location. Product. Date. Sale. LA Shoes Jan. 01 $21,500 $85.99 ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 32

Provided by: cyru8

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Multidimensional Databases

1
Mining Multidimensional Databases
Cyrus Shahabi University of Southern
California Dept. of Computer Science Los Angeles,
CA 90089-0781 shahabi_at_usc.edu http//infolab.usc.e
du
2
Outline

Distributed Information Management Laboratory
Multidimensional Data Sets Applications
(Examples)
Focus Application On-Line Analytical Processing
(OLAP)
Traditional Solution
PROPOLYNE Progressive Evaluation of Polynomial
Range-Sum Query

Location PHE-306 (and 108)
URL http//infolab.usc.edu
Research Staff 2
Admin Staff 1
Ph.D. Students 9
M.S. Students 8
Undergraduates 2
Ph.D. Alumni 3
M.S. Alumni Many!
Sponsors

4
Multidimensional Data Sets Applications
(Examples)

Similarity search, clustering,
Stock prices ? time-series
Images ? color histograms
Shapes ? angle sequences
Web navigation ? feature vectors
Spatial temporal queries, mining queries,
GeoSpatial data ? latitude, longitude, altitude
Remote sensory data ? ltlat, long, alt, time,
temperaturegt
Immersidata ? ltx, y, z, t, vgt

5
Stock Prices
S1
Sn
6
More Similarity Search Clustering
C
Shapes ICDE99 ICME00
7
Spatial Temporal Data
Complex Queries
ACM-GIS01, VLDB01

Data types
A point ltlatitude, longitude, altitudegt or ltx,
y, zgt
A line-segment ltx1, y1, x2, y2gt
A line sequence of line-segments
A region A closed set of lines
Moving point ltx, y, tgt (e.g., car, train, )
Changing region ltregion, value, tgt (e.g.,
changing temperature of a county)

Queries
Rivers ltintersectgt Countries
Hospitals ltingt Cities
Taxi ltwithingt 5km of Home
ltin the nextgt 10 min
Experiments ltoverlapgt BrainR

Visual99
8
Immersidata and Mining Queries
CIKM01, UACHI01
9
Immersidata and Mining Queries
A dynamic sign, e.g., ASL colors
Subject-1
10
Focus Application On-Line Analytical Processing
(OLAP)
Market-Relation

Multidimensional data sets
Dimension attributes (e.g., Store, Product, Data)
Measure attributes (e.g., Sale, Price)
Range-sum queries
Average sale of shoes in CA in 2001
Number of jackets sold in Seattle in Sep. 2001
Tougher queries
Covariance of sale and price of jackets in CA in
2001 (correlation)
Variance of price of jackets in 2001 in Seattle

Store Location
Date
Sale
Product
Price
LA Shoes Jan. 01 21,500 85.99
NY Jacket June 01 28,700 45.99
. . .
. . .
. . .
. . .
. . .
Too Slow!
11
Traditional Solution Pre-computation Prefix-sum
Agrawal et. al 1997
Salary
Age
Age Salary
100k
120k
150k
40k
55k
65k

50k
55k
58k
100k
130k
57 120k

0
25
40
Age
50
Salary
60

Disadvantages
Measure attribute should be pre-selected
Aggregation function should be pre-selected
Works only for limited of aggregation
functions
Updates are expensive (need re-computation)

80
Result I II III IV
12
PROPOLYNE Progressive Evaluationof Polynomial
Range-Sum Query (w/ Rolfe Schmidt)

Overview of PROPOLYNE
Features of PROPOLYNE
Polynomial Range-Sum Queries as Vector Queries
Naive Evaluation of Vector Queries Using Wavelets
Fast Evaluation of Vector Queries Using Wavelets
Progressive/Approximate Evaluation of Vector
Queries Using Wavelets
Related Work
Performance Results
Conclusion

13
Overview of PROPOLYNE

Define range-sum query as vector product of query
vector and data vector
Offline Multidimensional wavelet transform of
data
At the query time lazy wavelet transform of
query vector (very fast)
Dot product of query and data vectors in the
transformed domain ? exact result in O(2 log N)d
Choose high-energy query coefficients only ? fast
approximate result (90 accuracy by retrieving lt
10 of data)
Choose query coefficients in order of energy ?
progressive result

14
PROPOLYNE Features

All attributes can be treated as either
dimension or measure attributes
Function can be any polynomial on any
combination of attributes, i.e., not only SUM,
AVERAGE and COUNT but also COVARIANCE, VARIANCE
and SUMSQUARE
Independent from how well the data set can be
compressed/approximated by wavelet
Because We show range-sum queries can always
be approximated well by wavelets (not always HAAR
though!)
Low update cost O(logd N)
Can be used for exact, approximate and
progressive range-sum query evaluation

15
Polynomial Range-Sum Queries

Polynomial range-sum queries Q(R,f,I)
I is a finite instance of schema F
R SubSetOf Dom(F), is the range
f Dom(F) ? R is a polynomial of degree d

16
Polynomial Range-Sum Queries as Vector Queries

The data frequency distribution of I is the
function DI Dom(F) ? Z that maps a
point x to the number of times it occurs in I
To emphasize the fact that a query is an
operator on the data frequency distribution, we
write
Example D(25,50)D(28,55)D(57,120)1 and
D(x)0 otherwise.

17
Overview of Wavelets
H operator computes a local average of array a
at every other point to produce an array of
summary coefficients Ha Example (Haar)
h1/2,1/2
G operator measures how much values in the array
a vary inside each of the summarized blocks to
compute an array of detail coefficients
Ga Example (Haar) g1/2,-12
aka wavelet coefficients of a
18
Naive Evaluation of Vector Queries Using Wavelets

Hence, vector queries can be computed in the
wavelet-transformed space as
Algorithm
Off-line transformation of data vector (or data
distribution function, i.e., D, to be exact)
O (IldlogdN) for sparse data, O (I) Nd for
dense data
Real-time transformation of the query vector at
the query evaluation time
O (ldlogdN)
Sum-up the products of the corresponding elements
of data and query vectors
Retrieving elements of data vector O (Nd)

19
Fast Evaluation of Vector Queries Using Wavelets

Main intuitions
query vector can be transformed quickly because
most of the coefficients are known in advance
Transformed query vector has a large number of
negligible (e.g., zero) values (independent on
how well data can be approximated by wavelet)
Example Haar filter COUNT function on R5,12
on the domain of integers from 0 to 15

GH3a
H4a
At each step, you know the zeros
20
The Lazy Wavelet Transform
All summary coefficients computed in CONSTANT
time!
The only interesting activity happens on the
boundary.
Summary coefficient array looks almost exactly
like original array.
21
The Lazy Wavelet Transform
All detail coefficients computed in CONSTANT time!
The only interesting activity happens on the
boundary.
All but 2 detail coefficients at each level are
equal to zero!
22
Fast Evaluation of Vector Queries Using Wavelets

Technical Requirements
Wavelets must satisfy a moment condition
Wavelets should have small support (i.e., the
shorter the filter, the better)
Supports any Polynomial Range-Sum up to a degree
determined by the choice of wavelets
E.g., Haar can only support degree 0 (e.g.,
COUNT), while db4 can support up to degree 1
(e.g., SUM), and db6 for degree 2 (e.g.,
VARIANCE)
Standard DWT O (N)
Our lazy wavelet for transforming query function
O (l log N) where l is the length of the filter

23
Exact Evaluation of Vector Queries
Query SUM(salary) when (25 lt age lt 40) (55k
lt salary lt 150k)
of Wavelet Coefficients 1250
24
Approximate Evaluation of Vector Queries
25
Progressive Evaluation of Vector Queries
26
(No Transcript)
27
Experimental Setup

PETROL Data Set
Petroleum sales volume
56504 tuples
Five dimensions
ltlocation, product, year, month, volumegt
Sparseness 0.16
Traditional data approximation works well

GPS Data Set
Sensor readings from GPS ground stations in CA
3358 tuples
Four dimensions
ltlat, long, t, velocitygt
Velocity of upward movement of the station
Sparseness 0.01
Data approximation works poorly

250 range queries generated
Randomly from all possible ranges with the
uniform dist.
Ranges which select fewer than 100 tuples were
discarded
Median Relative error,

28
Performance Results
PETROL

Compact Data Cube (as a representative of data
approximation techniques) under 10 error after
using less than 10 of wavelet coefficients
(wavelet coefficients sorted in the order of
energy)

29
Performance Results
GPS

CDC needs 5 times as many coefficients as there
were tuples in the original table before
providing a median relative error of 10 (because
data cannot be compressed well)

30
(No Transcript)
31
Conclusion

A novel MOLAP pre-aggregation strategy
Supports conventional aggregates COUNT, SUM and
beyond COVARIANCE
First pre-aggregation technique that does not
require measures be specified a priori
Measures treated as functions of the attributes
at the query time
Provides a data independent progressive and
approximate query answering technique
With provably poly-logarithmic worst-case query
and update costs
And storage cost comparable or better than other
pre-aggregation methods

32
Future

PROPOLYNE future plans
Use synopsis information about query workloads or
data distribution for better sorting of
coefficients
Improve random access behavior of PROPOLYNE to
data by clustering related coeffiecents
More complex queries OLAP drill-down, general
relational algebra queries,
Multidimensional mining research directions
Efficient ways of finding trends (e.g.,
correlation between dimensions/attributes)
Efficient ways of finding surprises/outliers
Mining sequence data sets (e.g., genome
databases)