On Demand Classification of Data Streams presentation

About This Presentation

Transcript and Presenter's Notes

Title: On Demand Classification of Data Streams

1
On Demand Classification of Data Streams

Proc. 2004 Int. Conf. on Knowledge Discovery and
Data Mining (KDD'04), Seattle, WA, Aug. 2004
Speaker Pei-Min Chou Date2005/04/01
2
Outline

3
Introduction

Advances in data storage often grow without limit
referred to as data streams
one-pass mining model
does not recognize the changes and it is too
expensive to keep track of the entire history
static classification model likely to drop when
there is a sudden burst
Our model
simultaneous training and testing streams
used for dynamic classification of data sets

4
Supervised Micro-cluster Modify Micro-cluster

5
Snapshot

6
Maintenance Supervised Micro-clusters

Nearest neighbor and k-means algorithms
The initial micro-clusters is offline process
offline ---answers various user queries based on
the stored summary statistics
When a new data point Xik arrives, it is either
added to a micro-cluster, or a new micro-cluster
is created

7
Classification on Demand

8
Find the correct time-horizon

Macro-clusters are created over a user-specified
time horizon h
Let S(tc) the snapshot of micro-clusters at time
tc
S(tc-h) the snapshot of micro-clusters
at time tc-h
The new set of micro-clusters N(tc-h) are created
by subtracting S(tc-h) from S(tc)
Subtractive property
Let C1 and C2 be two sets of points such that
Then

9
Training Data Stream

A small portion of the stream is used for the
process of horizon fitting stream segment
kfit the number of points in the data used and
the value small as 1 of the data
remaining portion of the training stream is used
for the creation and maintenance of the
class-specific micro-clusters

10
The value of kfit

Horizon determined classification accuracy
Process executed periodically for changes
kfit should be small enough so that the points in
it reflect the immediate locality of tc
Qfit pre-specified number of time units
a part of the training stream
the class labels are known a-priori
Nearest neighbor procedure (XeQfit)
Find the closest micro-cluster in N(tc,h) to X
compare the class label and true label

11
Large or small horizon be chosen

The accuracy of all the time horizons which are
tracked by the geometric time frame are
determined
The p time horizons which provide the greatest
dynamic classification accuracy
by
First sight ---smallest
Stable stream ---large

12
Test

test stream is a separate process which is
executed continuously throughout the algorithm
Insert Xt , nearest neighbor classication process
is applied using each (Xt belong H)
results in the determination class lable
these p class labels reported as the relevant
class

13
Empirical Results

14
Experiment
15
Experiment
16
Experiment

Write a Comment

User Comments (0)

About PowerShow.com

On Demand Classification of Data Streams PowerPoint PPT Presentation