Title: Data%20Stream%20Management%20Systems%20Checkpoint
1Data Stream Management SystemsCheckpoint
- CS240B Notes
- by
- Carlo Zaniolo
- UCLA CSD
- With slides from a KDD04 tutorial by
- Haixun Wang, Jian Pei Philip Yu
2Mining Data Streams Challenges
- On-line response (NB), limited memory, most
recent windows only - Fast Light algorithms needed
- Must minimize usage of memory and CPU
- Requires only one (or a few) passes through data
- Concept shift/drift change mining set statistics
- Render previously learned models inaccurate or
invalid - Robustness and Adaptability quickly
recover/adjust after concept changes. - Popular machine learning algorithms no longer
effective - Neural nets slow learner requires many passes
- Support Vector Machines (SVM) computationally
expensive - Apriori many passes and expensive (association
rule mine difficult for on data streams)
3The Decision Tree Classifier
- Learning (Training)
- Input a data set of (a, b), where a is a vector,
b a class label - Output a model (decision tree)
- Testing
- Input a test sample (x, ?)
- Output a class label prediction for x
4Decision Tree Classifiers
- A divide-and-conquer approach
- Simple algorithm, intuitive model
- Typically a decision tree grows one level for
each scan of data - Multiple scans are required
- But if we can use small samples these problem
disappears - But data structure is not stable
- Subtle changes of data can cause global changes
in the data structure
5Stable Trees Using Samples
- How many samples do we need to build a tree in
constant time that is nearly identical to the
tree a batch learner (C4.5, Sprint,...) - Nearly identical?
- Categorical attributes
- with high probability, the attribute we choose
for split is the same attribute as would be
chosen by a batch learner - identical decision tree
- Continuous attributes
- discretize them into categorical ones
- ...Forget concept changes for now
6Hoeffding Trees
- Hoeffding bound is applied to the information
gain - Error decreases when n ( of samples) increases
- At each node, we shall accumulate enough samples
(n) before we make a split - Scales better than traditional DT algorithms
- Incremental the nodes are are created
incrementally as news samples stream in - Sub-linear with sampling
- Small memory requirement
- Cons
- Only consider top 2 attributes
- Tie breaking takes time
- Grow a deep tree takes time
- Discrete attribute only
7VFDT
- Very Fast Decision Tree Domingos, Hulten 2000
- Several Improvements faster and less memory
- Concept Changes? A naïve approach
- Place a sliding window on the stream
- Reapply C4.5 or VFDT whenever window moves
- Time consuming!
8CVFDT
- Concept-adapting VFDT
- Hulten, Spencer, Domingos, 2001
- Goal
- Classifying concept-drifting data streams
- Approach
- Make use of Hoeffding bound
- Incorporate windowing
- Monitor changes of information gain for
attributes. - If change reaches threshold, generate alternate
subtree with new best attribute, but keep on
background. - Replace if new subtree becomes more accurate.
9Classifiers for Data Streams
- Fast and Light Classifiers
- Naïve Bayesian one pass to count occurrences
- Sliding windows, tumbles and slides
- Adaptive Nearest Neighbor Classification
Algorithm--ANNCAD Fast and Light Classifiers - Ensembles of Classifiers--decision trees or
others - Bagging Ensembles and
- Boosting Ensembles
10Basic Ideas
- Stream partitioned into sequential chunks
- Train a classifier from each chunk
- Accuracy of voting ensembles is normally better
than that of a single classfier. - Method1. Bagging
- Weighted voting weights are assigned to
classifiers based on their recent performance on
the current test examples - Only top K classifiers are used
- Method2. Boosting
- Majority voting
- Classifiers retired by age
- Boosting used in training
11Bagging Ensemble Method
12Mining Streams with Concept Changes
- Changes detected by drop in accuracy or by other
methods - Build new classifiers on new windows
- Search among old ones those that have now become
accurate
13Boosting Ensembles for Adaptive Mining of Data
Streams
- Andrea Fang Chu, Carlo Zaniolo
- PAKDD2004
14Mining Data Stream Desiderata
- Fast learning (preferably in one pass of the
data.) - Light requirements (low time complexity, low
memory requirement) - Adaptation (model always reflects the
time-changing concept)
15Adaptive Boosting Ensembles
- Training stream is split into blocks (i.e.,
windows) - Each individual classifier is learned from a
block. - A boosting ensemble of (719 members) is
maintained over time - Decisions are taken by simple majority
- As the N1 classifier is build, boost the weight
of the tuples misclassified by the first N - Change detection is explored to achieve
adaptation.
16Fast and Light
- Experiments show that boosting ensembles of weak
learners provide accurate prediction - Weak Learners
- An aggressively pruned decision tree, e.g.,
shallow tree (this means fast!) - Trained on a small set of examples (this mean
light in memory requirements!)
17Adaptation Detect changes that cause significant
drops in ensemble performance ?
gradual changes concept drift
abrupt changes concept schift
18Adaptability
- The error rate is viewed as a random variable
- When it drops significantly from the recent
average the whole ensemble is dropped - And a new one is quickly re-learned
- Cost/performance of boosting ensembles is better
than that of bagging ensembles KDD04 - BUT ???
19References
- Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han.
Mining Concept Drifting Data Streams using
Ensemble Classifiers. In the ACM International
Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2003. - Pedro Domingos, Geoff Hulten. Mining High Speed
Data Streams. In the ACM International Conference
on Knowledge Discovery and Data Mining (SIGKDD)
2000. - Geoff Hulten, Laurie Spencer, Pedro Domingos.
Mining Time-Changing Data Streams. In the ACM
International Conference on Knowledge Discovery
and Data Mining (SIGKDD) 2001. - Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu.
Active Mining of Data Streams. In the SIAM
International Conference on Data Mining (SIAM DM)
- 2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An
adaptive learning approach for noisy data
streams, 4th IEEE International Conference on
Data Mining (ICDM), 2004 - Fang Chu, Carlo Zaniolo Fast and Light Boosting
for Adaptive Mining of Data Streams. PAKDD 2004
282-292. - Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest
Neighbor Classification Algorithm for Data
Streams, 2005 ECML/PKDD Conference, Porto,
Portugal, October 3-7, 2005. -