Title: cs412slides
1(No Transcript)
2D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Characteristics and challenges of data streams
- Stream data cubing
- Stream data clustering
- Stream classification and anomaly analysis
- Debugging sensor network systems by data mining
3Characteristics of Data Streams
- Data Streams
- Data streamscontinuous, ordered, changing, fast,
huge amount - Traditional DBMSdata stored in finite,
persistent data sets - Characteristics of data streams
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle scan algorithm
(can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
4Stream Data Applications
- Telecommunication calling records
- Business credit card transaction flows
- Network monitoring and traffic engineering
- Financial market stock exchange
- Engineering industrial processes power supply
manufacturing - Sensor, monitoring surveillance video streams,
RFIDs - Security monitoring
- Web logs and Web page click streams
- Massive data sets (even saved but random access
is too expensive)
5Architecture for Stream Query/Mining Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6Challenges for Mining Dynamics in Data Streams
- Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing - Analysis requirements
- Multi-dimensional trends and unusual patterns
- Capturing important changes at multi-dimensions/le
vels - Fast, real-time detection and response
- Comparing with data cube Similarity and
differences - Stream (data) cube or stream OLAP Is this
feasible? - Can we implement it efficiently?
7D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Characteristics and challenges of data streams
- Stream data cubing
- Stream data clustering
- Stream classification and anomaly analysis
- Debugging sensor network systems by data mining
8Multi-Dimensional Stream Analysis Examples
- Analysis of Web click streams
- Raw data at low levels seconds, web page
addresses, user IP addresses, - Analysts want changes, trends, unusual patterns,
at reasonable levels of details - E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours. - Analysis of power consumption streams
- Raw data power consumption flow for every
household, every minute - Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago
9A Stream Cube Architecture
- A tilted time frame
- Different time granularities
- second, minute, quarter, hour, day, week,
- Critical layers
- Minimum interest layer (m-layer)
- Observation layer (o-layer)
- User watches at o-layer and occasionally needs
to drill-down down to m-layer - Partial materialization of stream cubes
- Full materialization too space and time
consuming - No materialization slow response at query time
- Partial materialization what do we mean
partial?
10Cube A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
11Time Dimension A Titled Time Model
- Natural tilted time frame
- Example Minimal quarter, then 4 quarters ? 1
hour, 24 hours ? day, - Logarithmic tilted time frame
- Example Minimal 1 minute, then 1, 2, 4, 8, 16,
32,
12A Titled Time Model (2)
- Pyramidal tilted time frame
- Example Suppose there are 5 frames and each
takes maximal 3 snapshots - Given a snapshot number N, if N mod 2d 0,
insert into the frame number d. If there are
more than 3 snapshots, kick out the oldest one.
13Two Critical Layers in the Stream Cube
14OLAP Operation and Cube Materialization
- OLAP( Online Analytical Processing) operations
- Roll up (drill-up) summarize data
- by climbing up hierarchy or by dimension
reduction - Drill down (roll down) reverse of roll-up
- from higher level summary to lower level summary
or detailed data, or introducing new dimensions - Slice and dice project and select
- Pivot (rotate) reorient the cube, visualization,
3D to series of 2D planes - Cube partial materialization
- Store some pre-computed cuboids for fast online
processing
15D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Characteristics and challenges of data streams
- Stream data cubing
- Stream data clustering
- Stream classification and anomaly analysis
- Debugging sensor network systems by data mining
16On-Line Partial Materialization vs. OLAP
Processing
- On-line materialization
- Materialization takes precious space and time
- Only incremental materialization (with tilted
time frame) - Only materialize cuboids of the critical
layers? - Online computation may take too much time
- Preferred solution
- popular-path approach Materializing those along
the popular drilling paths - H-tree structure Such cuboids can be computed
and stored efficiently using the H-tree structure - Online aggregation vs. query-based computation
- Online computing while streaming aggregating
stream cubes - Query-based computation using computed cuboids
17Stream Cube Structure From m-layer to o-layer
18An H-Tree Cubing Structure
Minimal int. layer
19Benefits of H-Tree and H-Cubing
- H-tree and H-cubing
- Developed for computing data cubes and ice-berg
cubes - J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01 - Fast cubing, space preserving in cube computation
- Using H-tree for stream cubing
- Space preserving
- Intermediate aggregates can be computed
incrementally and saved in tree nodes - Facilitate computing other cells and
multi-dimensional analysis - H-tree with computed cells can be viewed as
stream cube
20D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Characteristics and challenges of data streams
- Stream data cubing
- Stream data clustering
- Stream classification and anomaly analysis
- Debugging sensor network systems by data mining
21 Stream Clustering A K-Median Approach
- O'Callaghan et al., Streaming-Data Algorithms
for High-Quality Clustering, (ICDE'02) - Base on the k-median method
- Data stream points from metric space
- Find k clusters in the stream s.t. the sum of
distances from data points to their closest
center is minimized - Constant factor approximation algorithm
- In small space, a simple two step algorithm
- For each set of M records, Si, find O(k) centers
in S1, , Sl - Local clustering Assign each point in Si to its
closest center - Let S be centers for S1, , Sl with each center
weighted by number of points assigned to it - Cluster S to find k centers
22 Hierarchical Clustering Tree
level-(i1) medians
level-i medians
data points
23Hierarchical Tree and Drawbacks
- Method
- maintain at most m level-i medians
- On seeing m of them, generate O(k) level-(i1)
medians of weight equal to the sum of the weights
of the intermediate medians assigned to them - Drawbacks
- Low quality for evolving data streams (register
only k centers) - Limited functionality in discovering and
exploring clusters over different portions of the
stream over time
24Clustering for Mining Stream Dynamics
- Network intrusion detection one example
- Detect bursts of activities or abrupt changes in
real timeby on-line clustering - Our methodology (C. Agarwal, J. Han, J. Wang,
P.S. Yu, VLDB03) - Tilted time frame work o.w. dynamic changes
cannot be found - Micro-clustering better quality than
k-means/k-median - incremental, online processing and maintenance)
- Two stages micro-clustering and macro-clustering
- With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection
25CluStream A Framework for Clustering Evolving
Data Streams
- Design goal
- High quality for clustering evolving data streams
with greater functionality - While keep the stream mining requirement in mind
- One-pass over the original stream data
- Limited space usage and high efficiency
- CluStream A framework for clustering evolving
data streams - Divide the clustering process into online and
offline components - Online component periodically stores summary
statistics about the stream data - Offline component answers various user questions
based on the stored summary statistics
26BIRCH A Micro-Clustering Approach
Clustering Feature CF (N, LS, SS) where N
data points, LS , SS
27The CluStream Framework
- Micro-cluster
- Statistical information about data locality
- Temporal extension of the cluster-feature vector
- Multi-dimensional points with time
stamps - Each point contains d dimensions, i.e.,
- A micro-cluster for n points is defined as a (2.d
3) tuple - Pyramidal time frame
- Decide at what moments the snapshots of the
statistical information are stored away on disk
28CluStream Pyramidal Time Frame
- Pyramidal time frame
- Snapshots of a set of micro-clusters are stored
following the pyramidal pattern - They are stored at differing levels of
granularity depending on the recency - Snapshots are classified into different orders
varying from 1 to log(T) - The i-th order snapshots occur at intervals of ai
where a 1 - Only the last (a 1) snapshots are stored
29CluStream Clustering On-line Streams
- Online micro-cluster maintenance
- Initial creation of q micro-clusters
- q is usually significantly larger than the number
of natural clusters - Online incremental update of micro-clusters
- If new point is within max-boundary, insert into
the micro-cluster - o.w., create a new cluster
- May delete obsolete micro-cluster or merge two
closest ones - Query-based macro-clustering
- Based on a user-specified time-horizon h and the
number of macro-clusters k, compute macroclusters
using the k-means algorithm
30D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Characteristics and challenges of data streams
- Stream data cubing
- Stream data clustering
- Stream classification and anomaly analysis
- Debugging sensor network systems by data mining
31Stream Classification and Concept Drifts
- Stream Classification
- Construct a classification model based on past
records - Use the model to predict labels for new data
- Help decision making
- Concept drifts
- Define and analyze concept drifts in data streams
- Show that expected error is not directly related
to concept drifts - Classify data stream with skewed distribution
(i.e., rare events) - Employ both sampling and ensemble techniques
- Results indicate the proposed method reduces
classification errors on the minority class
32Concept Drifts
- Changes in P(x, y) x-feature vector y-class label
P(x,y) P(yx)P(x) - Four possibilities
- No change P(yx), P(x) remain unchanged
- Feature change only P(x) changes
- Conditional change only P(yx) changes
- Dual change both P(yx) and P(x) changes
- Expected error
- No matter how concept changes, the expected error
could increase, decrease, or remain unchanged - Training on the most recent data could help
reduce expected error
33Issues in stream classification
- Descriptive model vs. generative model
- Generative models assume data follows some
distribution while descriptive models make no
assumptions - Distribution of stream data is unknown and may
evolve, so descriptive model is better - Label prediction vs. probability estimation
- Classify test examples into one class or estimate
P(yx) for each y - Probability estimation is better
- Stream applications may be stochastic (an example
could be assigned to several classes with
different probabilities) - Probability estimates provide confidence
information and could be used in post processing
34Mining Skewed Data Stream
- Skewed distribution
- Seen in many stream applications where positive
examples are much less popular than the negative
ones. - Credit card fraud detection, network intrusion
detection - Existing stream classification methods
- Evaluate their methods on data with balanced
class distribution - Problems of these methods on skewed data
- Tend to ignore positive examples due to the small
number - The cost of misclassifying positive examples is
usually huge, e.g., misclassifying credit card
fraud as normal
35Stream Ensemble Approach (1)
?
S1
S2
Sm
Sm1
Classification Model
Sm as training data? Positive examples not
sufficient!
36Stream Ensemble Approach (2)
Sampling
S1
S2
Ensemble
Sm
?
C1
C2
Ck
37Analysis
- Error Reduction
- Sampling
- Ensemble
- Efficiency Analysis
- Single model
- Ensemble
- Ensemble is more efficient
38Experiments(1)
- Test on concept-drift streams
39Experiments(2)
40Experiments(3)
41Experiments (4)
42Summary Stream Data Mining
- Stream data mining A rich and on-going research
field - Current research focus in database community
- DSMS system architecture, continuous query
processing, supporting mechanisms - Stream data mining and stream OLAP analysis
- Powerful tools for finding general and unusual
patterns - Effectiveness, efficiency and scalability lots
of open problems - Our philosophy on stream data analysis and mining
- A multi-dimensional stream analysis framework
- Time is a special dimension Tilted time frame
- What to compute and what to save?Critical layers
- Partial materialization and precomputation
- Mining dynamics of stream data
43References on Stream Mining
- Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-Dimensional Regression Analysis of
Time-Series Data Streams , Proc. 2002 Int. Conf.
on Very Large Data Bases (VLDB'02), Hong Kong,
China, Aug. 2002. - C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu,
Mining Frequent Patterns in Data Streams at
Multiple Time Granularities, H. Kargupta, A.
Joshi, K. Sivakumar, and Y. Yesha (eds.), Next
Generation Data Mining, 2003. - H. Wang, W. Fan, P. S. Yu, and J. Han, Mining
Concept-Drifting Data Streams using Ensemble
Classifiers, Proc. 2003 ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining (KDD'03),
Washington, D.C., Aug. 2003. - C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A
Framework for Clustering Evolving Data Streams,
Proc. 2003 Int. Conf. on Very Large Data Bases
(VLDB'03), Berlin, Germany, Sept. 2003. - Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge,
and L. Auvil, MAIDS Mining Alarming Incidents
from Data Streams, (system demonstration), Proc.
2004 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'04), Paris, France, June 2004. - C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On
Demand Classification of Data Streams, Proc.
2004 Int. Conf. on Knowledge Discovery and Data
Mining (KDD'04), Seattle, WA, Aug. 2004. - C. Aggarwal, J. Han, J. Wang, and P. S. Yu,
A Framework for Projected Clustering of High
Dimensional Data Streams, Proc. 2004 Int. Conf.
on Very Large Data Bases (VLDB'04), Toronto,
Canada, Aug. 2004. - Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei,
Benjamin W. Wah,Jianyong Wang, and Y. Dora Cai,
Stream Cube An Architecturefor
Multi-Dimensional Analysis of Data Streams,
Distributed and Parallel Databases, 18(2)
173-197, 2005. - J. Yang, X. Yan, J. Han, and W. Wang,
Discovering Evolutionary Classifier over High
Speed Non-static Stream, in S. Bandyopadhyay et
al. (eds.), Advanced Methods for Knowledge
Discovery from Complex Data, Springer Verlag,
2005. - J. Yang, X. Yan, J. Han, and W. Wang,
Discovering Evolutionary Classifier over High
Speed Non-Static Stream, in S. Bandyopadhyay et
al. (eds.), Advanced Methods for Knowledge
Discovery from Complex Data, Springer Verlag,
2005. - Hongyan Liu, Ying Lu, Jiawei Han, and Jun He,
Error-Adaptive and Time-Aware Maintenance of
Frequency Counts over Data Streams, in Proc.
2006 Int. Conf. on Web-Age Information Management
(WAIM'06), Hong Kong, China, June, 2006. - Jing Gao, Wei Fan, and Jiawei Han, A General
Framework for Mining Concept-Drifting Data
Streams with Skewed Distributions, in Proc. 2007
SIAM Int. Conf. on Data Mining (SDM'07),
Minneapolis, MN, April 2007. - Jing Gao, Wei Fan, and Jiawei Han, On
Appropriate Assumptions to Mine Data Streams
Analysis and Practice, Proc. 2007 Int. Conf. on
Data Mining (ICDM'07), Omaha, NE, Oct. 2007.
44D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Debugging sensor network systems by data mining
- Software bug mining
- Bug mining for sensor networks
45Data Mining for Software Engineering and Computer
System Analysis
- Software bug localization and failure proximity
- C. Liu, Z. Lian, and J. Han, How Bayesians
Debug?, ICDM'06. - Chao Liu and Jiawei Han, Failure Proximity A
Fault Localization-Based Approach, FSE'06. - C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff,
SOBER Statistical Model-based Bug
Localization, FSE 2005. - Detection of software plagiarism
- C. Liu, C. Chen, J.Han, and P.S. Yu, GPLAG
Detection of Software Plagiarism by Procedure
Dependency Graph Analysis, KDD'06. - Mining copy-paste bugs in operating systems
- Z. Li, S. Lu, S. Myagmar and Y. Zhou. CP-Miner
A Tool for Finding Copy-paste and Related Bugs in
Operating System Code. OSDI'04.
46SOBER Bug Localization based on Classification
of Statistical Distribution of Statement Execution
Failing
O
Passing
O
O
O
O
O
O
O
O
O
- C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff,
SOBER Statistical Model-based Bug
Localization, FSE 2005.
47A Comparison with Other Approaches
48Failure Clustering Based on Likely Fault Roots
- Failure indexing
- Identify failures likely due to the same bug
Y
- C. Liu and J. Han, Failure Proximity A Fault
Localization-Based Approach, FSE'06.
X
0
49D1. Data Mining in Data Stream and Sensor Networks
- Mining data streams
- Debugging sensor network systems by data mining
- Software bug mining
- Bug mining for sensor networks
50Challenges at Developing Robust Sensor Network
Systems
- It is tricky and frustrating at developing robust
sensor network systems - Bugs of networked sensor system are often cased
by complex and interactions between multiple,
often individually non-faulty components - Bugs are often not repeatable, particular
sequences of events that invokes the bug may not
be easy to reconstruct - Current status Most of the development time is
at debugging and trouble shooting the current
code ? greatly reduces productivity
51DustMiner Troubleshooting Interactive Complexity
Bugs in Sensor Networks
- Dustmine Mine sequences of events that may be
responsible for faulty behavior, as opposed to
localized bugs in one module - M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J.
Han, DustMiner Troubleshooting Interactive
Complexity Bugs in Sensor Networks, Proc. 2008
ACM Int. Conf. on Embedded Networked Sensor
Systems (Sensys'08), Raleigh, NC, Nov. 2008 - Architecture
- Front-end collects runtime system logs being
debugged - Offline backend frequent, discriminative pattern
mining to uncover likely causes of failure
52Major Difficulties of SNTS
- Our (Terak Abdelzaher et al.s) previous sensor
network debugging system, SNTS DCOSS, 2007,
extracts conditions on current network state
correlated with failure - Mining frequent patterns (occur frequently when
the bugs manifest), however, the cause of a
problem is often an infrequent pattern - DustMiner Automated discriminative sequence
mining, containing two phases - Identifies frequent patterns that correlate to
failures as before - Focuses on those patterns, correlating them with
(infrequent) events that may have caused them,
hence, uncovering the true root of the problem
53Architecture of DustMiner
54Preventing False Freq. Patterns Using Dynamic
Search Window
- Two sample sequences, with different behaviors
- S1 lta, b, c, d, a, b, c, dgt
- S2 lta, b, c, d, a, c, b, dgt
- The system fails when ltagt is followed by ltCgt
before ltbgt - How to detect lt a, c, bgt as a discriminative
pattern? - Using Apriori will not be able to detect it
- Solution using dynamic search window scheme
- Suppose the search window is 1, 4, 4, 8 in
both sequences - Then lta, c, bgt will only be found at sequence S2
- The dynamic search window scheme will also
speed up the search significantly
55Suppressing Redundant Subsequences
- Two sample sequences, with different behaviors
- S1 lta, b, c, d, a, b, c, dgt
- S2 lta, b, c, d, a, b, d, cgt
- The system has to have ltagt followed by ltcgt before
ltdgt - E.g., ltenableRadiogt ltmessageSentgt, ltackRgt,
ltdisableRadiogt - How to detect lta, d, cgt as an error pattern?
- Using Apriori will not be able to detect it
- Solution using sequence compression scheme
- Remove sequence Si if it is a subsequence of Sj
with same support - lta, b, dgt will be removed from S1 but retained in
S2
56Two-Stage Mining for Infrequent Events
- In debugging, sometimes less frequent patterns
could be more indicative - E.g., A singe node reboot event can cause a large
number of message losses - Freq. pattern (FP) mining will miss the real
cause! - Observation Much computation in sensor networks
is recurrent - Two stage mining
- Catch such recurrent symptoms (such as multiple
subsequent message losses or false alarms) by FP
mining - Narrow down the search space and correlated these
symptoms with other less freq. preceding event
occurrences
57Experiment I LiteOSBug
- Troubleshoot a simple data collection application
where several sensors monitor light and report it
to a sink node - Discriminative patterns found only on good logs
- Discriminative patterns found only on bad logs
58References on Trouble-Shooting in Software and
Networked Sensor Systems
- M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J.
Han, DustMiner Troubleshooting Interactive
Complexity Bugs in Sensor Networks, Proc. 2008
ACM Int. Conf. on Embedded Networked Sensor
Systems (Sensys'08), Raleigh, NC, Nov. 2008 - M. K. Ramanathan, A. Grama, and S. Jagannathan,
Path-Sensitive Inference of Function Precedence
Protocols, ICSE 2007 - Z. Li and Y. Zhou, PR-Miner Automatically
Extracting Implicit Programming Rules and
Detecting Violations in Large Software Code,
ESEC/FSE 2005 - B. Livshits and T. Zimmermann, DynaMine Finding
Common Error Patterns by Mining Software Revision
Histories, ESEC/FSE 2005 - D. Andrzejewski, A. Mulhern, B. Liblit, and X.
Zhu, Statistical Debugging Using Latent Topic
Models, ECML 2007 - B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M.
I. Jordan, Scalable Statistical Bug Isolation,
PLDI 2005 - C. Liu, L. Fei, X. Yan, J. Han and S. Midkiff,
Statistical Debugging A Hypothesis Testing-Based
Approach, IEEE TSE 2006 - C. Liu, Z. Lian and J. Han, How Bayesians Debug,
ICDM 2006 - C. Liu, X. Yan and J. Han, Mining Control Flow
Abnormality for Logic Error Isolation, SDM 2006 - C. Liu, X. Yan, L. Fei, J. Han and S.l Midkiff,
SOBER Statistical Model-Based Bug Localization,
ESEC/FSE 2005 - C. Liu, X. Yan, H. Yu, J. Han and P. S. Yu,
Mining Behavior Graphs for "Backtrace" of
Noncrashing Bugs, SDM 2005 - A. X. Zheng, M. I. Jordan, B. Liblit, M. Naik,
and A. Aiken, Statistical Debugging Simultaneous
Identification of Multiple Bugs, ICML 2006 - I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,
Correlating instrumentation data to system
states A building block for automated diagnosis
and control, OSDI 2004. - J. Platt, E. Kiciman and D. Maltz, Fast
Variational Inference for Large-scale Internet
Diagnosis, NIPS 2007. - Rob Powers, Ira Cohen, and Moises Goldszmidt,
"Short term performance forecasting in enterprise
systems", KDD 2005. - Y.-M. Wang, C. Verbowski, J. Dunagan, Y. Chen, H.
J. Wang, C. Yuan, and Z. Zhang, STRIDER A
Black-box, State-based Approach to Change and
Configuration Management and Support, Usenix
LISA 2003.
59(No Transcript)