Title: Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
1Mining Unusual Patterns in Data Streams in
Multi-Dimensional Space
- Jiawei Han
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- www.cs.uiuc.edu/hanj
2Outline
- Characteristics of data streams
- Mining unusual patterns in data streams
- Multi-dimensional regression analysis of data
streams - Stream cubing and stream OLAP methods
- Mining other kinds of patterns in data streams
- Research problems
3Data Streams
- Data Streams
- Data streamscontinuous, ordered, changing, fast,
huge amount - Traditional DBMSdata stored in finite,
persistent data sets - Characteristics
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle linear scan
algorithm (can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
4Stream Data Applications
- Telecommunication calling records
- Business credit card transaction flows
- Network monitoring and traffic engineering
- Financial market stock exchange
- Engineering industrial processes power supply
manufacturing - Sensor, monitoring surveillance video streams
- Security monitoring
- Web logs and Web page click streams
- Massive data sets (even saved but random access
is too expensive)
5Challenges of Stream Data Mining
- Multiple, continuous, rapid, time-varying,
ordered streams - Main memory computation
- Mining queries are either continuous or ad-hoc
- Mining queries are often complex
- Involving multiple streams, large amount of data,
and history - Finding patterns, models, anomaly, differences,
- Mining dynamics (changes, trends and evolutions)
of data streams - Multi-level/multi-dimensional processing and data
mining - Most stream data are at pretty low-level or
multi-dimensional in nature
6Stream Data Mining Tasks
- Multi-dimensional (on-line) analysis of streams
- Clustering data streams
- Classification of data streams
- Mining frequent patterns in data streams
- Mining sequential patterns in data streams
- Mining partial periodicity in data streams
- Mining notable gradients in data streams
- Mining outliers and unusual patterns in data
streams
7Multi-Dimensional Stream Analysis Examples
- Analysis of Web click streams
- Raw data at low levels seconds, web page
addresses, user IP addresses, - Analysts want changes, trends, unusual patterns,
at reasonable levels of details - E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours. - Analysis of power consumption streams
- Raw data power consumption flow for every
household, every minute - Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago
8A Key StepStream Data Reduction
- Challenges of OLAPing stream data
- Raw data cannot be stored
- Simple aggregates are not powerful enough
- History shape and patterns at different levels
are desirable multi-dimensional regression
analysis - Proposal
- A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data - Stream data compression
- Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis
9Regression Cube for Time-Series
- Initially, one time-series per base cell
- Too costly to store all these time-series
- Too costly to compute regression at
multi-dimensional space - Regression cube
- Base cube only store regression parameters of
base cells (e.g., 2 points vs. 1000 points) - All the upper level cuboids can be computed
precisely for linear regression on both standard
dimensions and time dimensions - For quadratic regression, we need 5 points
- In general, we need
- where k 2 for quadratic.
10Basics of General Linear Regression
- n tuples in one cell (xi , yi), i 1..n, where
yi is the measure attribute to be analyzed - For sample i , a vector of k user-defined
predictors ui - The linear regression model
-
- where ? is a k 1 vector of regression
parameters
11Theory of General Linear Regression
- Collect into the model matrix U
- The ordinary least square (OLS) estimate of
is the argument that minimizes the residue sum of
squares function - Main theorem to determine the OLS regression
parameters
12Linearly Compressed Representation (LCR)
- Stream data compression for multi-dimensional
regression analysis - Define, for i, j 0,,k-1
- The linearly compressed representation (LCR) of
one cell - Size of LCR of one cell
- quadratic in k, independent of the number of
tuples n in one cell
13Matrix Form of LCR
- LCR consists of and , where
-
- and
-
- where
-
- provides OLS regression parameters essential for
regression analysis - is an auxiliary matrix that facilitates
aggregations of LCR in standard and regression
dimensions in a data cube environment -
- LCR only stores
the upper triangle of
14Aggregation in Standard Dimensions
- Given LCR of m cells that differ in one standard
dimension, what is the LCR of the cell aggregated
in that dimension? - for m base cells
- for an aggregated cell
- The lossless aggregation formula
15Stock Price ExampleAggregation in Standard
Dimensions
- Simple linear regression on time series data
- Cells of two companies
- After aggregation
16Aggregation in Regression Dimensions
- Given LCR of m cells that differ in one
regression dimension, what is the LCR of the cell
aggregated in that dimension -
for m base cells - for the
aggregated cell - The lossless aggregation formula
17Stock Price ExampleAggregation in Time Dimension
- Cells of two adjacent
- time intervals
- After aggregation
18Feasibility of Stream Regression Analysis
- Efficient storage and scalable (independent of
the number of tuples in data cells) - Lossless aggregation without accessing the raw
data - Fast aggregation computationally efficient
- Regression models of data cells at all levels
- General results covered a large and the most
popular class of regression - Including quadratic, polynomial, and nonlinear
models
19A Stream Cube Architecture
- A tilted time frame
- Different time granularities
- second, minute, quarter, hour, day, week,
- Critical layers
- Minimum interest layer (m-layer)
- Observation layer (o-layer)
- User watches at o-layer and occasionally needs
to drill-down down to m-layer - Partial materialization of stream cubes
- Full materialization too space and time
consuming - No materialization slow response at query time
- Partial materialization what do we mean
partial?
20A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
4t
2t
1t
4t
8t
16t
Time
Now
21Benefits of Tilted Time-Frame Model
- Each cell stores the measures according to
tilt-time-frame - Limited memory space Impossible to store the
history in full scale - Emphasis more on recent data
- Most applications emphasize on recent data (slide
window) - Natural partition on different time granularities
- Putting different weights on remote data
- Useful even for uniform weight
- Tilted time-frame forms a new time dimension
- for mining changes and evolutions
- Essential for mining unusual patterns or outliers
- Finding those with dramatic changes
- E.g., exceptional stocksnot following the trends
22Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
23On-Line Materialization vs. On-Line Computation
- On-line materialization
- Materialization takes precious resources and time
- Only incremental materialization (with slide
window) - Only materialize cuboids of the critical
layers? - Some intermediate cells that should be
materialized - Popular path approach vs. exception cell approach
- Materialize intermediate cells along the popular
paths - Exception cells how to set up exception
thresholds? - Notice exceptions do not have monotonic behaviour
- Computation problem
- How to compute and store stream cubes
efficiently? - How to discover unusual cells between the
critical layer?
24Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
25Stream Cube Computation
- Cube structure from m-layer to o-layer
- Three approaches
- All cuboids approach
- Materializing all cells (too much in both space
and time) - Exceptional cells approach
- Materializing only exceptional cells (saves space
but not time to compute and definition of
exception is not flexible) - Popular path approach
- Computing and materializing cells only along a
popular path - Using H-tree structure to store computed cells
(which form the stream cubea selectively
materialized cube)
26An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
27Partial Materialization Using H-Tree
- H-tree
- Introduced for computing data cubes and iceberg
cubes - J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01 - Compressed database, fast cubing, and space
preserving in cube computation - Using H-tree for partial stream cubing
- Space preserving
- Intermediate aggregates can be computed
incrementally and saved in tree nodes - Facilitate computing other cells and
multi-dimensional analysis - H-tree with computed cells can be viewed as
stream cube
28Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
29Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
30Mining Other Unusual Patterns in Stream Data
- Clustering and outlier analysis for stream mining
- Clustering data streams (Guha, Motwani et al.
2000-2002) - History-sensitive, high-quality incremental
clustering - Classification of stream data
- Evolution of decision trees Domingos et al.
(2000, 2001) - Incremental integration of new streams in
decision-tree induction - Frequent pattern analysis
- Approximate frequent patterns (Manku Motwani
VLDB02) - Evolution and dramatic changes of frequent
patterns
31Conclusions
- Stream data mining A rich and largely unexplored
field - Current research focus in database community
- DSMS system architecture, continuous query
processing, supporting mechanisms - Stream data mining and stream OLAP analysis
- Powerful tools for finding general and unusual
patterns - Effectiveness, efficiency and scalability lots
of open problems - Our philosophy
- A multi-dimensional stream analysis framework
- Time is a special dimension tilted time frame
- What to compute and what to save?Critical layers
- Very partial materialization/precomputation
popular path approach - Mining dynamics of stream data
32References
- B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial). - S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001. - Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02. - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00. - M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial). - J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01. - S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.
33www.cs.uiuc.edu/hanj