Title: Mining Multidimensional Databases
1 Mining Multidimensional Databases
Cyrus Shahabi University of Southern
California Dept. of Computer Science Los Angeles,
CA 90089-0781 shahabi_at_usc.edu http//infolab.usc.e
du
2Outline
- Distributed Information Management Laboratory
- Multidimensional Data Sets Applications
(Examples) - Focus Application On-Line Analytical Processing
(OLAP) - Traditional Solution
- PROPOLYNE Progressive Evaluation of Polynomial
Range-Sum Query
3- Location PHE-306 (and 108)
- URL http//infolab.usc.edu
- Research Staff 2
- Admin Staff 1
- Ph.D. Students 9
- M.S. Students 8
- Undergraduates 2
- Ph.D. Alumni 3
- M.S. Alumni Many!
- Sponsors
4Multidimensional Data Sets Applications
(Examples)
- Similarity search, clustering,
- Stock prices ? time-series
- Images ? color histograms
- Shapes ? angle sequences
- Web navigation ? feature vectors
- Spatial temporal queries, mining queries,
- GeoSpatial data ? latitude, longitude, altitude
- Remote sensory data ? ltlat, long, alt, time,
temperaturegt - Immersidata ? ltx, y, z, t, vgt
5Stock Prices
S1
Sn
6More Similarity Search Clustering
C
Shapes ICDE99 ICME00
7Spatial Temporal Data
Complex Queries
ACM-GIS01, VLDB01
- Data types
- A point ltlatitude, longitude, altitudegt or ltx,
y, zgt - A line-segment ltx1, y1, x2, y2gt
- A line sequence of line-segments
- A region A closed set of lines
- Moving point ltx, y, tgt (e.g., car, train, )
- Changing region ltregion, value, tgt (e.g.,
changing temperature of a county)
- Queries
- Rivers ltintersectgt Countries
- Hospitals ltingt Cities
- Taxi ltwithingt 5km of Home
- ltin the nextgt 10 min
- Experiments ltoverlapgt BrainR
Visual99
8Immersidata and Mining Queries
CIKM01, UACHI01
9Immersidata and Mining Queries
A dynamic sign, e.g., ASL colors
Subject-1
10Focus Application On-Line Analytical Processing
(OLAP)
Market-Relation
- Multidimensional data sets
- Dimension attributes (e.g., Store, Product, Data)
- Measure attributes (e.g., Sale, Price)
- Range-sum queries
- Average sale of shoes in CA in 2001
- Number of jackets sold in Seattle in Sep. 2001
- Tougher queries
- Covariance of sale and price of jackets in CA in
2001 (correlation) - Variance of price of jackets in 2001 in Seattle
Store Location
Date
Sale
Product
Price
LA Shoes Jan. 01 21,500 85.99
NY Jacket June 01 28,700 45.99
. . .
. . .
. . .
. . .
. . .
Too Slow!
11Traditional Solution Pre-computation Prefix-sum
Agrawal et. al 1997
Salary
Age
Age Salary
100k
120k
150k
40k
55k
65k
- 50k
- 55k
- 58k
- 100k
- 130k
- 57 120k
0
25
40
Age
50
Salary
60
- Disadvantages
- Measure attribute should be pre-selected
- Aggregation function should be pre-selected
- Works only for limited of aggregation
functions - Updates are expensive (need re-computation)
80
Result I II III IV
12PROPOLYNE Progressive Evaluationof Polynomial
Range-Sum Query (w/ Rolfe Schmidt)
- Overview of PROPOLYNE
- Features of PROPOLYNE
- Polynomial Range-Sum Queries as Vector Queries
- Naive Evaluation of Vector Queries Using Wavelets
- Fast Evaluation of Vector Queries Using Wavelets
- Progressive/Approximate Evaluation of Vector
Queries Using Wavelets - Related Work
- Performance Results
- Conclusion
13Overview of PROPOLYNE
- Define range-sum query as vector product of query
vector and data vector - Offline Multidimensional wavelet transform of
data - At the query time lazy wavelet transform of
query vector (very fast) - Dot product of query and data vectors in the
transformed domain ? exact result in O(2 log N)d - Choose high-energy query coefficients only ? fast
approximate result (90 accuracy by retrieving lt
10 of data) - Choose query coefficients in order of energy ?
progressive result
14PROPOLYNE Features
- All attributes can be treated as either
dimension or measure attributes - Function can be any polynomial on any
combination of attributes, i.e., not only SUM,
AVERAGE and COUNT but also COVARIANCE, VARIANCE
and SUMSQUARE - Independent from how well the data set can be
compressed/approximated by wavelet - Because We show range-sum queries can always
be approximated well by wavelets (not always HAAR
though!) - Low update cost O(logd N)
- Can be used for exact, approximate and
progressive range-sum query evaluation
15Polynomial Range-Sum Queries
- Polynomial range-sum queries Q(R,f,I)
- I is a finite instance of schema F
- R SubSetOf Dom(F), is the range
- f Dom(F) ? R is a polynomial of degree d
16Polynomial Range-Sum Queries as Vector Queries
- The data frequency distribution of I is the
function DI Dom(F) ? Z that maps a
point x to the number of times it occurs in I - To emphasize the fact that a query is an
operator on the data frequency distribution, we
write - Example D(25,50)D(28,55)D(57,120)1 and
D(x)0 otherwise. -
17Overview of Wavelets
H operator computes a local average of array a
at every other point to produce an array of
summary coefficients Ha Example (Haar)
h1/2,1/2
G operator measures how much values in the array
a vary inside each of the summarized blocks to
compute an array of detail coefficients
Ga Example (Haar) g1/2,-12
aka wavelet coefficients of a
18Naive Evaluation of Vector Queries Using Wavelets
- Hence, vector queries can be computed in the
wavelet-transformed space as - Algorithm
- Off-line transformation of data vector (or data
distribution function, i.e., D, to be exact) - O (IldlogdN) for sparse data, O (I) Nd for
dense data - Real-time transformation of the query vector at
the query evaluation time - O (ldlogdN)
- Sum-up the products of the corresponding elements
of data and query vectors - Retrieving elements of data vector O (Nd)
19Fast Evaluation of Vector Queries Using Wavelets
- Main intuitions
- query vector can be transformed quickly because
most of the coefficients are known in advance - Transformed query vector has a large number of
negligible (e.g., zero) values (independent on
how well data can be approximated by wavelet) - Example Haar filter COUNT function on R5,12
on the domain of integers from 0 to 15
GH3a
H4a
At each step, you know the zeros
20The Lazy Wavelet Transform
All summary coefficients computed in CONSTANT
time!
The only interesting activity happens on the
boundary.
Summary coefficient array looks almost exactly
like original array.
21The Lazy Wavelet Transform
All detail coefficients computed in CONSTANT time!
The only interesting activity happens on the
boundary.
All but 2 detail coefficients at each level are
equal to zero!
22Fast Evaluation of Vector Queries Using Wavelets
- Technical Requirements
- Wavelets must satisfy a moment condition
- Wavelets should have small support (i.e., the
shorter the filter, the better) - Supports any Polynomial Range-Sum up to a degree
determined by the choice of wavelets - E.g., Haar can only support degree 0 (e.g.,
COUNT), while db4 can support up to degree 1
(e.g., SUM), and db6 for degree 2 (e.g.,
VARIANCE) - Standard DWT O (N)
- Our lazy wavelet for transforming query function
O (l log N) where l is the length of the filter
23Exact Evaluation of Vector Queries
Query SUM(salary) when (25 lt age lt 40) (55k
lt salary lt 150k)
of Wavelet Coefficients 1250
24Approximate Evaluation of Vector Queries
25Progressive Evaluation of Vector Queries
26(No Transcript)
27Experimental Setup
- PETROL Data Set
- Petroleum sales volume
- 56504 tuples
- Five dimensions
- ltlocation, product, year, month, volumegt
- Sparseness 0.16
- Traditional data approximation works well
- GPS Data Set
- Sensor readings from GPS ground stations in CA
- 3358 tuples
- Four dimensions
- ltlat, long, t, velocitygt
- Velocity of upward movement of the station
- Sparseness 0.01
- Data approximation works poorly
- 250 range queries generated
- Randomly from all possible ranges with the
uniform dist. - Ranges which select fewer than 100 tuples were
discarded - Median Relative error,
28Performance Results
PETROL
- Compact Data Cube (as a representative of data
approximation techniques) under 10 error after
using less than 10 of wavelet coefficients
(wavelet coefficients sorted in the order of
energy)
29Performance Results
GPS
- CDC needs 5 times as many coefficients as there
were tuples in the original table before
providing a median relative error of 10 (because
data cannot be compressed well)
30(No Transcript)
31Conclusion
- A novel MOLAP pre-aggregation strategy
- Supports conventional aggregates COUNT, SUM and
beyond COVARIANCE - First pre-aggregation technique that does not
require measures be specified a priori - Measures treated as functions of the attributes
at the query time - Provides a data independent progressive and
approximate query answering technique - With provably poly-logarithmic worst-case query
and update costs - And storage cost comparable or better than other
pre-aggregation methods
32Future
- PROPOLYNE future plans
- Use synopsis information about query workloads or
data distribution for better sorting of
coefficients - Improve random access behavior of PROPOLYNE to
data by clustering related coeffiecents - More complex queries OLAP drill-down, general
relational algebra queries, - Multidimensional mining research directions
- Efficient ways of finding trends (e.g.,
correlation between dimensions/attributes) - Efficient ways of finding surprises/outliers
- Mining sequence data sets (e.g., genome
databases)