cs412slides

About This Presentation

Title:

cs412slides

Description:

Random access is expensive single scan algorithm (can only have one look) ... new point is within max-boundary, insert into the micro-cluster. o.w., create a new ... – PowerPoint PPT presentation

Number of Views:164

Avg rating:3.0/5.0

Slides: 58

Provided by: jiaw186

Category:

more less

Transcript and Presenter's Notes

Title: cs412slides

1
(No Transcript)
2
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Characteristics and challenges of data streams
Stream data cubing
Stream data clustering
Stream classification and anomaly analysis
Debugging sensor network systems by data mining

3
Characteristics of Data Streams

Data Streams
Data streamscontinuous, ordered, changing, fast,
huge amount
Traditional DBMSdata stored in finite,
persistent data sets
Characteristics of data streams
Huge volumes of continuous data, possibly
infinite
Fast changing and requires fast, real-time
response
Data stream captures nicely our data processing
needs of today
Random access is expensivesingle scan algorithm
(can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing

4
Stream Data Applications

Telecommunication calling records
Business credit card transaction flows
Network monitoring and traffic engineering
Financial market stock exchange
Engineering industrial processes power supply
manufacturing
Sensor, monitoring surveillance video streams,
RFIDs
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access
is too expensive)

5
Architecture for Stream Query/Mining Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges for Mining Dynamics in Data Streams

Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/le
vels
Fast, real-time detection and response
Comparing with data cube Similarity and
differences
Stream (data) cube or stream OLAP Is this
feasible?
Can we implement it efficiently?

7
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Characteristics and challenges of data streams
Stream data cubing
Stream data clustering
Stream classification and anomaly analysis
Debugging sensor network systems by data mining

8
Multi-Dimensional Stream Analysis Examples

Analysis of Web click streams
Raw data at low levels seconds, web page
addresses, user IP addresses,
Analysts want changes, trends, unusual patterns,
at reasonable levels of details
E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours.
Analysis of power consumption streams
Raw data power consumption flow for every
household, every minute
Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago

9
A Stream Cube Architecture

A tilted time frame
Different time granularities
second, minute, quarter, hour, day, week,
Critical layers
Minimum interest layer (m-layer)
Observation layer (o-layer)
User watches at o-layer and occasionally needs
to drill-down down to m-layer
Partial materialization of stream cubes
Full materialization too space and time
consuming
No materialization slow response at query time
Partial materialization what do we mean
partial?

10
Cube A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
11
Time Dimension A Titled Time Model

Natural tilted time frame
Example Minimal quarter, then 4 quarters ? 1
hour, 24 hours ? day,
Logarithmic tilted time frame
Example Minimal 1 minute, then 1, 2, 4, 8, 16,
32,

12
A Titled Time Model (2)

Pyramidal tilted time frame
Example Suppose there are 5 frames and each
takes maximal 3 snapshots
Given a snapshot number N, if N mod 2d 0,
insert into the frame number d. If there are
more than 3 snapshots, kick out the oldest one.

13
Two Critical Layers in the Stream Cube
14
OLAP Operation and Cube Materialization

OLAP( Online Analytical Processing) operations
Roll up (drill-up) summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down) reverse of roll-up
from higher level summary to lower level summary
or detailed data, or introducing new dimensions
Slice and dice project and select
Pivot (rotate) reorient the cube, visualization,
3D to series of 2D planes
Cube partial materialization
Store some pre-computed cuboids for fast online
processing

15
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Characteristics and challenges of data streams
Stream data cubing
Stream data clustering
Stream classification and anomaly analysis
Debugging sensor network systems by data mining

16
On-Line Partial Materialization vs. OLAP
Processing

On-line materialization
Materialization takes precious space and time
Only incremental materialization (with tilted
time frame)
Only materialize cuboids of the critical
layers?
Online computation may take too much time
Preferred solution
popular-path approach Materializing those along
the popular drilling paths
H-tree structure Such cuboids can be computed
and stored efficiently using the H-tree structure
Online aggregation vs. query-based computation
Online computing while streaming aggregating
stream cubes
Query-based computation using computed cuboids

17
Stream Cube Structure From m-layer to o-layer
18
An H-Tree Cubing Structure
Minimal int. layer
19
Benefits of H-Tree and H-Cubing

H-tree and H-cubing
Developed for computing data cubes and ice-berg
cubes
J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01
Fast cubing, space preserving in cube computation
Using H-tree for stream cubing
Space preserving
Intermediate aggregates can be computed
incrementally and saved in tree nodes
Facilitate computing other cells and
multi-dimensional analysis
H-tree with computed cells can be viewed as
stream cube

20
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Characteristics and challenges of data streams
Stream data cubing
Stream data clustering
Stream classification and anomaly analysis
Debugging sensor network systems by data mining

21

Stream Clustering A K-Median Approach

O'Callaghan et al., Streaming-Data Algorithms
for High-Quality Clustering, (ICDE'02)
Base on the k-median method
Data stream points from metric space
Find k clusters in the stream s.t. the sum of
distances from data points to their closest
center is minimized
Constant factor approximation algorithm
In small space, a simple two step algorithm
For each set of M records, Si, find O(k) centers
in S1, , Sl
Local clustering Assign each point in Si to its
closest center
Let S be centers for S1, , Sl with each center
weighted by number of points assigned to it
Cluster S to find k centers

22

Hierarchical Clustering Tree

level-(i1) medians
level-i medians
data points
23
Hierarchical Tree and Drawbacks

Method
maintain at most m level-i medians
On seeing m of them, generate O(k) level-(i1)
medians of weight equal to the sum of the weights
of the intermediate medians assigned to them
Drawbacks
Low quality for evolving data streams (register
only k centers)
Limited functionality in discovering and
exploring clusters over different portions of the
stream over time

24
Clustering for Mining Stream Dynamics

Network intrusion detection one example
Detect bursts of activities or abrupt changes in
real timeby on-line clustering
Our methodology (C. Agarwal, J. Han, J. Wang,
P.S. Yu, VLDB03)
Tilted time frame work o.w. dynamic changes
cannot be found
Micro-clustering better quality than
k-means/k-median
incremental, online processing and maintenance)
Two stages micro-clustering and macro-clustering
With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection

25
CluStream A Framework for Clustering Evolving
Data Streams

Design goal
High quality for clustering evolving data streams
with greater functionality
While keep the stream mining requirement in mind
One-pass over the original stream data
Limited space usage and high efficiency
CluStream A framework for clustering evolving
data streams
Divide the clustering process into online and
offline components
Online component periodically stores summary
statistics about the stream data
Offline component answers various user questions
based on the stored summary statistics

26
BIRCH A Micro-Clustering Approach
Clustering Feature CF (N, LS, SS) where N
data points, LS , SS

27
The CluStream Framework

Micro-cluster
Statistical information about data locality
Temporal extension of the cluster-feature vector
Multi-dimensional points with time
stamps
Each point contains d dimensions, i.e.,
A micro-cluster for n points is defined as a (2.d
3) tuple
Pyramidal time frame
Decide at what moments the snapshots of the
statistical information are stored away on disk

28
CluStream Pyramidal Time Frame

Pyramidal time frame
Snapshots of a set of micro-clusters are stored
following the pyramidal pattern
They are stored at differing levels of
granularity depending on the recency
Snapshots are classified into different orders
varying from 1 to log(T)
The i-th order snapshots occur at intervals of ai
where a 1
Only the last (a 1) snapshots are stored

29
CluStream Clustering On-line Streams

Online micro-cluster maintenance
Initial creation of q micro-clusters
q is usually significantly larger than the number
of natural clusters
Online incremental update of micro-clusters
If new point is within max-boundary, insert into
the micro-cluster
o.w., create a new cluster
May delete obsolete micro-cluster or merge two
closest ones
Query-based macro-clustering
Based on a user-specified time-horizon h and the
number of macro-clusters k, compute macroclusters
using the k-means algorithm

30
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Characteristics and challenges of data streams
Stream data cubing
Stream data clustering
Stream classification and anomaly analysis
Debugging sensor network systems by data mining

31
Stream Classification and Concept Drifts

Stream Classification
Construct a classification model based on past
records
Use the model to predict labels for new data
Help decision making
Concept drifts
Define and analyze concept drifts in data streams
Show that expected error is not directly related
to concept drifts
Classify data stream with skewed distribution
(i.e., rare events)
Employ both sampling and ensemble techniques
Results indicate the proposed method reduces
classification errors on the minority class

32
Concept Drifts

Changes in P(x, y) x-feature vector y-class label
P(x,y) P(yx)P(x)
Four possibilities
No change P(yx), P(x) remain unchanged
Feature change only P(x) changes
Conditional change only P(yx) changes
Dual change both P(yx) and P(x) changes
Expected error
No matter how concept changes, the expected error
could increase, decrease, or remain unchanged
Training on the most recent data could help
reduce expected error

33
Issues in stream classification

Descriptive model vs. generative model
Generative models assume data follows some
distribution while descriptive models make no
assumptions
Distribution of stream data is unknown and may
evolve, so descriptive model is better
Label prediction vs. probability estimation
Classify test examples into one class or estimate
P(yx) for each y
Probability estimation is better
Stream applications may be stochastic (an example
could be assigned to several classes with
different probabilities)
Probability estimates provide confidence
information and could be used in post processing

34
Mining Skewed Data Stream

Skewed distribution
Seen in many stream applications where positive
examples are much less popular than the negative
ones.
Credit card fraud detection, network intrusion
detection
Existing stream classification methods
Evaluate their methods on data with balanced
class distribution
Problems of these methods on skewed data
Tend to ignore positive examples due to the small
number
The cost of misclassifying positive examples is
usually huge, e.g., misclassifying credit card
fraud as normal

35
Stream Ensemble Approach (1)
?

S1
S2
Sm
Sm1
Classification Model
Sm as training data? Positive examples not
sufficient!
36
Stream Ensemble Approach (2)
Sampling

S1
S2
Ensemble
Sm
?

C1
C2
Ck
37
Analysis

Error Reduction
Sampling
Ensemble
Efficiency Analysis
Single model
Ensemble
Ensemble is more efficient

38
Experiments(1)

Test on concept-drift streams

39
Experiments(2)

Test on real data

40
Experiments(3)

Model accuracy

41
Experiments (4)

Training time

42
Summary Stream Data Mining

Stream data mining A rich and on-going research
field
Current research focus in database community
DSMS system architecture, continuous query
processing, supporting mechanisms
Stream data mining and stream OLAP analysis
Powerful tools for finding general and unusual
patterns
Effectiveness, efficiency and scalability lots
of open problems
Our philosophy on stream data analysis and mining
A multi-dimensional stream analysis framework
Time is a special dimension Tilted time frame
What to compute and what to save?Critical layers
Partial materialization and precomputation
Mining dynamics of stream data

43
References on Stream Mining

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-Dimensional Regression Analysis of
Time-Series Data Streams , Proc. 2002 Int. Conf.
on Very Large Data Bases (VLDB'02), Hong Kong,
China, Aug. 2002.
C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu,
Mining Frequent Patterns in Data Streams at
Multiple Time Granularities, H. Kargupta, A.
Joshi, K. Sivakumar, and Y. Yesha (eds.), Next
Generation Data Mining, 2003.
H. Wang, W. Fan, P. S. Yu, and J. Han, Mining
Concept-Drifting Data Streams using Ensemble
Classifiers, Proc. 2003 ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (KDD'03),
Washington, D.C., Aug. 2003.
C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A
Framework for Clustering Evolving Data Streams,
Proc. 2003 Int. Conf. on Very Large Data Bases
(VLDB'03), Berlin, Germany, Sept. 2003.
Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge,
and L. Auvil, MAIDS Mining Alarming Incidents
from Data Streams, (system demonstration), Proc.
2004 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'04), Paris, France, June 2004.
C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On
Demand Classification of Data Streams, Proc.
2004 Int. Conf. on Knowledge Discovery and Data
Mining (KDD'04), Seattle, WA, Aug. 2004.
C. Aggarwal, J. Han, J. Wang, and P. S. Yu,
A Framework for Projected Clustering of High
Dimensional Data Streams, Proc. 2004 Int. Conf.
on Very Large Data Bases (VLDB'04), Toronto,
Canada, Aug. 2004.
Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei,
Benjamin W. Wah,Jianyong Wang, and Y. Dora Cai,
Stream Cube An Architecturefor
Multi-Dimensional Analysis of Data Streams,
Distributed and Parallel Databases, 18(2)
173-197, 2005.
J. Yang, X. Yan, J. Han, and W. Wang,
Discovering Evolutionary Classifier over High
Speed Non-static Stream, in S. Bandyopadhyay et
al. (eds.), Advanced Methods for Knowledge
Discovery from Complex Data, Springer Verlag,
2005.
J. Yang, X. Yan, J. Han, and W. Wang,
Discovering Evolutionary Classifier over High
Speed Non-Static Stream, in S. Bandyopadhyay et
al. (eds.), Advanced Methods for Knowledge
Discovery from Complex Data, Springer Verlag,
2005.
Hongyan Liu, Ying Lu, Jiawei Han, and Jun He,
Error-Adaptive and Time-Aware Maintenance of
Frequency Counts over Data Streams, in Proc.
2006 Int. Conf. on Web-Age Information Management
(WAIM'06), Hong Kong, China, June, 2006.
Jing Gao, Wei Fan, and Jiawei Han, A General
Framework for Mining Concept-Drifting Data
Streams with Skewed Distributions, in Proc. 2007
SIAM Int. Conf. on Data Mining (SDM'07),
Minneapolis, MN, April 2007.
Jing Gao, Wei Fan, and Jiawei Han, On
Appropriate Assumptions to Mine Data Streams
Analysis and Practice, Proc. 2007 Int. Conf. on
Data Mining (ICDM'07), Omaha, NE, Oct. 2007.

44
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Debugging sensor network systems by data mining
Software bug mining
Bug mining for sensor networks

45
Data Mining for Software Engineering and Computer
System Analysis

Software bug localization and failure proximity
C. Liu, Z. Lian, and J. Han, How Bayesians
Debug?, ICDM'06.
Chao Liu and Jiawei Han, Failure Proximity A
Fault Localization-Based Approach, FSE'06.
C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff,
SOBER Statistical Model-based Bug
Localization, FSE 2005.
Detection of software plagiarism
C. Liu, C. Chen, J.Han, and P.S. Yu, GPLAG
Detection of Software Plagiarism by Procedure
Dependency Graph Analysis, KDD'06.
Mining copy-paste bugs in operating systems
Z. Li, S. Lu, S. Myagmar and Y. Zhou. CP-Miner
A Tool for Finding Copy-paste and Related Bugs in
Operating System Code. OSDI'04.

46
SOBER Bug Localization based on Classification
of Statistical Distribution of Statement Execution

Failing

O
Passing
O
O
O
O
O
O
O
O
O

C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff,
SOBER Statistical Model-based Bug
Localization, FSE 2005.

47
A Comparison with Other Approaches
48
Failure Clustering Based on Likely Fault Roots

Failure indexing
Identify failures likely due to the same bug

C. Liu and J. Han, Failure Proximity A Fault
Localization-Based Approach, FSE'06.

X
0
49
D1. Data Mining in Data Stream and Sensor Networks

Mining data streams
Debugging sensor network systems by data mining
Software bug mining
Bug mining for sensor networks

50
Challenges at Developing Robust Sensor Network
Systems

It is tricky and frustrating at developing robust
sensor network systems
Bugs of networked sensor system are often cased
by complex and interactions between multiple,
often individually non-faulty components
Bugs are often not repeatable, particular
sequences of events that invokes the bug may not
be easy to reconstruct
Current status Most of the development time is
at debugging and trouble shooting the current
code ? greatly reduces productivity

51
DustMiner Troubleshooting Interactive Complexity
Bugs in Sensor Networks

Dustmine Mine sequences of events that may be
responsible for faulty behavior, as opposed to
localized bugs in one module
M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J.
Han, DustMiner Troubleshooting Interactive
Complexity Bugs in Sensor Networks, Proc. 2008
ACM Int. Conf. on Embedded Networked Sensor
Systems (Sensys'08), Raleigh, NC, Nov. 2008
Architecture
Front-end collects runtime system logs being
debugged
Offline backend frequent, discriminative pattern
mining to uncover likely causes of failure

52
Major Difficulties of SNTS

Our (Terak Abdelzaher et al.s) previous sensor
network debugging system, SNTS DCOSS, 2007,
extracts conditions on current network state
correlated with failure
Mining frequent patterns (occur frequently when
the bugs manifest), however, the cause of a
problem is often an infrequent pattern
DustMiner Automated discriminative sequence
mining, containing two phases
Identifies frequent patterns that correlate to
failures as before
Focuses on those patterns, correlating them with
(infrequent) events that may have caused them,
hence, uncovering the true root of the problem

53
Architecture of DustMiner
54
Preventing False Freq. Patterns Using Dynamic
Search Window

Two sample sequences, with different behaviors
S1 lta, b, c, d, a, b, c, dgt
S2 lta, b, c, d, a, c, b, dgt
The system fails when ltagt is followed by ltCgt
before ltbgt
How to detect lt a, c, bgt as a discriminative
pattern?
Using Apriori will not be able to detect it
Solution using dynamic search window scheme
Suppose the search window is 1, 4, 4, 8 in
both sequences
Then lta, c, bgt will only be found at sequence S2
The dynamic search window scheme will also
speed up the search significantly

55
Suppressing Redundant Subsequences

Two sample sequences, with different behaviors
S1 lta, b, c, d, a, b, c, dgt
S2 lta, b, c, d, a, b, d, cgt
The system has to have ltagt followed by ltcgt before
ltdgt
E.g., ltenableRadiogt ltmessageSentgt, ltackRgt,
ltdisableRadiogt
How to detect lta, d, cgt as an error pattern?
Using Apriori will not be able to detect it
Solution using sequence compression scheme
Remove sequence Si if it is a subsequence of Sj
with same support
lta, b, dgt will be removed from S1 but retained in
S2

56
Two-Stage Mining for Infrequent Events

In debugging, sometimes less frequent patterns
could be more indicative
E.g., A singe node reboot event can cause a large
number of message losses
Freq. pattern (FP) mining will miss the real
cause!
Observation Much computation in sensor networks
is recurrent
Two stage mining
Catch such recurrent symptoms (such as multiple
subsequent message losses or false alarms) by FP
mining
Narrow down the search space and correlated these
symptoms with other less freq. preceding event
occurrences

57
Experiment I LiteOSBug

Troubleshoot a simple data collection application
where several sensors monitor light and report it
to a sink node
Discriminative patterns found only on good logs
Discriminative patterns found only on bad logs

58
References on Trouble-Shooting in Software and
Networked Sensor Systems

M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J.
Han, DustMiner Troubleshooting Interactive
Complexity Bugs in Sensor Networks, Proc. 2008
ACM Int. Conf. on Embedded Networked Sensor
Systems (Sensys'08), Raleigh, NC, Nov. 2008
M. K. Ramanathan, A. Grama, and S. Jagannathan,
Path-Sensitive Inference of Function Precedence
Protocols, ICSE 2007
Z. Li and Y. Zhou, PR-Miner Automatically
Extracting Implicit Programming Rules and
Detecting Violations in Large Software Code,
ESEC/FSE 2005
B. Livshits and T. Zimmermann, DynaMine Finding
Common Error Patterns by Mining Software Revision
Histories, ESEC/FSE 2005
D. Andrzejewski, A. Mulhern, B. Liblit, and X.
Zhu, Statistical Debugging Using Latent Topic
Models, ECML 2007
B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M.
I. Jordan, Scalable Statistical Bug Isolation,
PLDI 2005
C. Liu, L. Fei, X. Yan, J. Han and S. Midkiff,
Statistical Debugging A Hypothesis Testing-Based
Approach, IEEE TSE 2006
C. Liu, Z. Lian and J. Han, How Bayesians Debug,
ICDM 2006
C. Liu, X. Yan and J. Han, Mining Control Flow
Abnormality for Logic Error Isolation, SDM 2006
C. Liu, X. Yan, L. Fei, J. Han and S.l Midkiff,
SOBER Statistical Model-Based Bug Localization,
ESEC/FSE 2005
C. Liu, X. Yan, H. Yu, J. Han and P. S. Yu,
Mining Behavior Graphs for "Backtrace" of
Noncrashing Bugs, SDM 2005
A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik,
and A. Aiken, Statistical Debugging Simultaneous
Identification of Multiple Bugs, ICML 2006
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,
Correlating instrumentation data to system
states A building block for automated diagnosis
and control, OSDI 2004.
J. Platt, E. Kiciman and D. Maltz, Fast
Variational Inference for Large-scale Internet
Diagnosis, NIPS 2007.
Rob Powers, Ira Cohen, and Moises Goldszmidt,
"Short term performance forecasting in enterprise
systems", KDD 2005.
Y.-M. Wang, C. Verbowski, J. Dunagan, Y. Chen, H.
J. Wang, C. Yuan, and Z. Zhang, STRIDER A
Black-box, State-based Approach to Change and
Configuration Management and Support, Usenix
LISA 2003.