Mining Multidimensional Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Multidimensional Databases

Description:

Dimension attributes (e.g., Store, Product, Data) Measure attributes (e.g., Sale, Price) ... Store. Location. Product. Date. Sale. LA Shoes Jan. 01 $21,500 $85.99 ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 32
Provided by: cyru8
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Multidimensional Databases


1
Mining Multidimensional Databases
Cyrus Shahabi University of Southern
California Dept. of Computer Science Los Angeles,
CA 90089-0781 shahabi_at_usc.edu http//infolab.usc.e
du
2
Outline
  • Distributed Information Management Laboratory
  • Multidimensional Data Sets Applications
    (Examples)
  • Focus Application On-Line Analytical Processing
    (OLAP)
  • Traditional Solution
  • PROPOLYNE Progressive Evaluation of Polynomial
    Range-Sum Query

3
  • Location PHE-306 (and 108)
  • URL http//infolab.usc.edu
  • Research Staff 2
  • Admin Staff 1
  • Ph.D. Students 9
  • M.S. Students 8
  • Undergraduates 2
  • Ph.D. Alumni 3
  • M.S. Alumni Many!
  • Sponsors

4
Multidimensional Data Sets Applications
(Examples)
  • Similarity search, clustering,
  • Stock prices ? time-series
  • Images ? color histograms
  • Shapes ? angle sequences
  • Web navigation ? feature vectors
  • Spatial temporal queries, mining queries,
  • GeoSpatial data ? latitude, longitude, altitude
  • Remote sensory data ? ltlat, long, alt, time,
    temperaturegt
  • Immersidata ? ltx, y, z, t, vgt

5
Stock Prices
S1
Sn
6
More Similarity Search Clustering
C
Shapes ICDE99 ICME00
7
Spatial Temporal Data
Complex Queries
ACM-GIS01, VLDB01
  • Data types
  • A point ltlatitude, longitude, altitudegt or ltx,
    y, zgt
  • A line-segment ltx1, y1, x2, y2gt
  • A line sequence of line-segments
  • A region A closed set of lines
  • Moving point ltx, y, tgt (e.g., car, train, )
  • Changing region ltregion, value, tgt (e.g.,
    changing temperature of a county)
  • Queries
  • Rivers ltintersectgt Countries
  • Hospitals ltingt Cities
  • Taxi ltwithingt 5km of Home
  • ltin the nextgt 10 min
  • Experiments ltoverlapgt BrainR

Visual99
8
Immersidata and Mining Queries
CIKM01, UACHI01
9
Immersidata and Mining Queries
A dynamic sign, e.g., ASL colors
Subject-1
10
Focus Application On-Line Analytical Processing
(OLAP)
Market-Relation
  • Multidimensional data sets
  • Dimension attributes (e.g., Store, Product, Data)
  • Measure attributes (e.g., Sale, Price)
  • Range-sum queries
  • Average sale of shoes in CA in 2001
  • Number of jackets sold in Seattle in Sep. 2001
  • Tougher queries
  • Covariance of sale and price of jackets in CA in
    2001 (correlation)
  • Variance of price of jackets in 2001 in Seattle

Store Location
Date
Sale
Product
Price
LA Shoes Jan. 01 21,500 85.99
NY Jacket June 01 28,700 45.99
. . .
. . .
. . .
. . .
. . .
Too Slow!
11
Traditional Solution Pre-computation Prefix-sum
Agrawal et. al 1997
Salary
Age
Age Salary
100k
120k
150k
40k
55k
65k
  • 50k
  • 55k
  • 58k
  • 100k
  • 130k
  • 57 120k

0
25
40
Age
50
Salary
60
  • Disadvantages
  • Measure attribute should be pre-selected
  • Aggregation function should be pre-selected
  • Works only for limited of aggregation
    functions
  • Updates are expensive (need re-computation)

80
Result I II III IV
12
PROPOLYNE Progressive Evaluationof Polynomial
Range-Sum Query (w/ Rolfe Schmidt)
  • Overview of PROPOLYNE
  • Features of PROPOLYNE
  • Polynomial Range-Sum Queries as Vector Queries
  • Naive Evaluation of Vector Queries Using Wavelets
  • Fast Evaluation of Vector Queries Using Wavelets
  • Progressive/Approximate Evaluation of Vector
    Queries Using Wavelets
  • Related Work
  • Performance Results
  • Conclusion

13
Overview of PROPOLYNE
  • Define range-sum query as vector product of query
    vector and data vector
  • Offline Multidimensional wavelet transform of
    data
  • At the query time lazy wavelet transform of
    query vector (very fast)
  • Dot product of query and data vectors in the
    transformed domain ? exact result in O(2 log N)d
  • Choose high-energy query coefficients only ? fast
    approximate result (90 accuracy by retrieving lt
    10 of data)
  • Choose query coefficients in order of energy ?
    progressive result

14
PROPOLYNE Features
  • All attributes can be treated as either
    dimension or measure attributes
  • Function can be any polynomial on any
    combination of attributes, i.e., not only SUM,
    AVERAGE and COUNT but also COVARIANCE, VARIANCE
    and SUMSQUARE
  • Independent from how well the data set can be
    compressed/approximated by wavelet
  • Because We show range-sum queries can always
    be approximated well by wavelets (not always HAAR
    though!)
  • Low update cost O(logd N)
  • Can be used for exact, approximate and
    progressive range-sum query evaluation

15
Polynomial Range-Sum Queries
  • Polynomial range-sum queries Q(R,f,I)
  • I is a finite instance of schema F
  • R SubSetOf Dom(F), is the range
  • f Dom(F) ? R is a polynomial of degree d

16
Polynomial Range-Sum Queries as Vector Queries
  • The data frequency distribution of I is the
    function DI Dom(F) ? Z that maps a
    point x to the number of times it occurs in I
  • To emphasize the fact that a query is an
    operator on the data frequency distribution, we
    write
  • Example D(25,50)D(28,55)D(57,120)1 and
    D(x)0 otherwise.

17
Overview of Wavelets
H operator computes a local average of array a
at every other point to produce an array of
summary coefficients Ha Example (Haar)
h1/2,1/2
G operator measures how much values in the array
a vary inside each of the summarized blocks to
compute an array of detail coefficients
Ga Example (Haar) g1/2,-12
aka wavelet coefficients of a
18
Naive Evaluation of Vector Queries Using Wavelets
  • Hence, vector queries can be computed in the
    wavelet-transformed space as
  • Algorithm
  • Off-line transformation of data vector (or data
    distribution function, i.e., D, to be exact)
  • O (IldlogdN) for sparse data, O (I) Nd for
    dense data
  • Real-time transformation of the query vector at
    the query evaluation time
  • O (ldlogdN)
  • Sum-up the products of the corresponding elements
    of data and query vectors
  • Retrieving elements of data vector O (Nd)

19
Fast Evaluation of Vector Queries Using Wavelets
  • Main intuitions
  • query vector can be transformed quickly because
    most of the coefficients are known in advance
  • Transformed query vector has a large number of
    negligible (e.g., zero) values (independent on
    how well data can be approximated by wavelet)
  • Example Haar filter COUNT function on R5,12
    on the domain of integers from 0 to 15

GH3a
H4a
At each step, you know the zeros
20
The Lazy Wavelet Transform
All summary coefficients computed in CONSTANT
time!
The only interesting activity happens on the
boundary.
Summary coefficient array looks almost exactly
like original array.
21
The Lazy Wavelet Transform
All detail coefficients computed in CONSTANT time!
The only interesting activity happens on the
boundary.
All but 2 detail coefficients at each level are
equal to zero!
22
Fast Evaluation of Vector Queries Using Wavelets
  • Technical Requirements
  • Wavelets must satisfy a moment condition
  • Wavelets should have small support (i.e., the
    shorter the filter, the better)
  • Supports any Polynomial Range-Sum up to a degree
    determined by the choice of wavelets
  • E.g., Haar can only support degree 0 (e.g.,
    COUNT), while db4 can support up to degree 1
    (e.g., SUM), and db6 for degree 2 (e.g.,
    VARIANCE)
  • Standard DWT O (N)
  • Our lazy wavelet for transforming query function
    O (l log N) where l is the length of the filter

23
Exact Evaluation of Vector Queries
Query SUM(salary) when (25 lt age lt 40) (55k
lt salary lt 150k)
of Wavelet Coefficients 1250
24
Approximate Evaluation of Vector Queries
25
Progressive Evaluation of Vector Queries
26
(No Transcript)
27
Experimental Setup
  • PETROL Data Set
  • Petroleum sales volume
  • 56504 tuples
  • Five dimensions
  • ltlocation, product, year, month, volumegt
  • Sparseness 0.16
  • Traditional data approximation works well
  • GPS Data Set
  • Sensor readings from GPS ground stations in CA
  • 3358 tuples
  • Four dimensions
  • ltlat, long, t, velocitygt
  • Velocity of upward movement of the station
  • Sparseness 0.01
  • Data approximation works poorly
  • 250 range queries generated
  • Randomly from all possible ranges with the
    uniform dist.
  • Ranges which select fewer than 100 tuples were
    discarded
  • Median Relative error,

28
Performance Results
PETROL
  • Compact Data Cube (as a representative of data
    approximation techniques) under 10 error after
    using less than 10 of wavelet coefficients
    (wavelet coefficients sorted in the order of
    energy)

29
Performance Results
GPS
  • CDC needs 5 times as many coefficients as there
    were tuples in the original table before
    providing a median relative error of 10 (because
    data cannot be compressed well)

30
(No Transcript)
31
Conclusion
  • A novel MOLAP pre-aggregation strategy
  • Supports conventional aggregates COUNT, SUM and
    beyond COVARIANCE
  • First pre-aggregation technique that does not
    require measures be specified a priori
  • Measures treated as functions of the attributes
    at the query time
  • Provides a data independent progressive and
    approximate query answering technique
  • With provably poly-logarithmic worst-case query
    and update costs
  • And storage cost comparable or better than other
    pre-aggregation methods

32
Future
  • PROPOLYNE future plans
  • Use synopsis information about query workloads or
    data distribution for better sorting of
    coefficients
  • Improve random access behavior of PROPOLYNE to
    data by clustering related coeffiecents
  • More complex queries OLAP drill-down, general
    relational algebra queries,
  • Multidimensional mining research directions
  • Efficient ways of finding trends (e.g.,
    correlation between dimensions/attributes)
  • Efficient ways of finding surprises/outliers
  • Mining sequence data sets (e.g., genome
    databases)
Write a Comment
User Comments (0)
About PowerShow.com