Data%20Stream%20Management%20Systems%20Checkpoint - PowerPoint PPT Presentation

About This Presentation

Title:

Data%20Stream%20Management%20Systems%20Checkpoint

Description:

Bagging ... Bagging Ensemble Method. 13. 13. Mining Streams with Concept Changes ... of boosting ensembles is better than that of bagging ensembles [KDD04] ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 20

Provided by: mir135

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data%20Stream%20Management%20Systems%20Checkpoint

1
Data Stream Management SystemsCheckpoint

CS240B Notes
by
Carlo Zaniolo
UCLA CSD
With slides from a KDD04 tutorial by
Haixun Wang, Jian Pei Philip Yu

2
Mining Data Streams Challenges

On-line response (NB), limited memory, most
recent windows only
Fast Light algorithms needed
Must minimize usage of memory and CPU
Requires only one (or a few) passes through data
Concept shift/drift change mining set statistics
Render previously learned models inaccurate or
invalid
Robustness and Adaptability quickly
recover/adjust after concept changes.
Popular machine learning algorithms no longer
effective
Neural nets slow learner requires many passes
Support Vector Machines (SVM) computationally
expensive
Apriori many passes and expensive (association
rule mine difficult for on data streams)

3
The Decision Tree Classifier

Learning (Training)
Input a data set of (a, b), where a is a vector,
b a class label
Output a model (decision tree)
Testing
Input a test sample (x, ?)
Output a class label prediction for x

4
Decision Tree Classifiers

A divide-and-conquer approach
Simple algorithm, intuitive model
Typically a decision tree grows one level for
each scan of data
Multiple scans are required
But if we can use small samples these problem
disappears
But data structure is not stable
Subtle changes of data can cause global changes
in the data structure

5
Stable Trees Using Samples

How many samples do we need to build a tree in
constant time that is nearly identical to the
tree a batch learner (C4.5, Sprint,...)
Nearly identical?
Categorical attributes
with high probability, the attribute we choose
for split is the same attribute as would be
chosen by a batch learner
identical decision tree
Continuous attributes
discretize them into categorical ones
...Forget concept changes for now

6
Hoeffding Trees

Hoeffding bound is applied to the information
gain
Error decreases when n ( of samples) increases
At each node, we shall accumulate enough samples
(n) before we make a split
Scales better than traditional DT algorithms
Incremental the nodes are are created
incrementally as news samples stream in
Sub-linear with sampling
Small memory requirement
Cons
Only consider top 2 attributes
Tie breaking takes time
Grow a deep tree takes time
Discrete attribute only

7
VFDT

Very Fast Decision Tree Domingos, Hulten 2000
Several Improvements faster and less memory
Concept Changes? A naïve approach
Place a sliding window on the stream
Reapply C4.5 or VFDT whenever window moves
Time consuming!

8
CVFDT

Concept-adapting VFDT
Hulten, Spencer, Domingos, 2001
Goal
Classifying concept-drifting data streams
Approach
Make use of Hoeffding bound
Incorporate windowing
Monitor changes of information gain for
attributes.
If change reaches threshold, generate alternate
subtree with new best attribute, but keep on
background.
Replace if new subtree becomes more accurate.

9
Classifiers for Data Streams

Fast and Light Classifiers
Naïve Bayesian one pass to count occurrences
Sliding windows, tumbles and slides
Adaptive Nearest Neighbor Classification
Algorithm--ANNCAD Fast and Light Classifiers
Ensembles of Classifiers--decision trees or
others
Bagging Ensembles and
Boosting Ensembles

10
Basic Ideas

Stream partitioned into sequential chunks
Train a classifier from each chunk
Accuracy of voting ensembles is normally better
than that of a single classfier.
Method1. Bagging
Weighted voting weights are assigned to
classifiers based on their recent performance on
the current test examples
Only top K classifiers are used
Method2. Boosting
Majority voting
Classifiers retired by age
Boosting used in training

11
Bagging Ensemble Method
12
Mining Streams with Concept Changes

Changes detected by drop in accuracy or by other
methods
Build new classifiers on new windows
Search among old ones those that have now become
accurate

13
Boosting Ensembles for Adaptive Mining of Data
Streams

Andrea Fang Chu, Carlo Zaniolo
PAKDD2004

14
Mining Data Stream Desiderata

Fast learning (preferably in one pass of the
data.)
Light requirements (low time complexity, low
memory requirement)
Adaptation (model always reflects the
time-changing concept)

15
Adaptive Boosting Ensembles

Training stream is split into blocks (i.e.,
windows)
Each individual classifier is learned from a
block.
A boosting ensemble of (719 members) is
maintained over time
Decisions are taken by simple majority
As the N1 classifier is build, boost the weight
of the tuples misclassified by the first N
Change detection is explored to achieve
adaptation.

16
Fast and Light

Experiments show that boosting ensembles of weak
learners provide accurate prediction
Weak Learners
An aggressively pruned decision tree, e.g.,
shallow tree (this means fast!)
Trained on a small set of examples (this mean
light in memory requirements!)

17
Adaptation Detect changes that cause significant
drops in ensemble performance ?
gradual changes concept drift
abrupt changes concept schift
18
Adaptability

The error rate is viewed as a random variable
When it drops significantly from the recent
average the whole ensemble is dropped
And a new one is quickly re-learned
Cost/performance of boosting ensembles is better
than that of bagging ensembles KDD04
BUT ???

19
References

Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han.
Mining Concept Drifting Data Streams using
Ensemble Classifiers. In the ACM International
Conference on Knowledge Discovery and Data Mining
(SIGKDD) 2003.
Pedro Domingos, Geoff Hulten. Mining High Speed
Data Streams. In the ACM International Conference
on Knowledge Discovery and Data Mining (SIGKDD)
2000.
Geoff Hulten, Laurie Spencer, Pedro Domingos.
Mining Time-Changing Data Streams. In the ACM
International Conference on Knowledge Discovery
and Data Mining (SIGKDD) 2001.
Wei Fan, Yi-an Huang, Haixun Wang, Philip S Yu.
Active Mining of Data Streams. In the SIAM
International Conference on Data Mining (SIAM DM)
2004Fang Chu, Yizhou Wang, Carlo Zaniolo, An
adaptive learning approach for noisy data
streams, 4th IEEE International Conference on
Data Mining (ICDM), 2004
Fang Chu, Carlo Zaniolo Fast and Light Boosting
for Adaptive Mining of Data Streams. PAKDD 2004
282-292.
Yan-Nei Law, Carlo Zaniolo, An Adaptive Nearest
Neighbor Classification Algorithm for Data
Streams, 2005 ECML/PKDD Conference, Porto,
Portugal, October 3-7, 2005.