On Demand Classification of Data Streams - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

On Demand Classification of Data Streams

Description:

does not recognize the changes and it is too expensive to ... Subtractive property. Let C1 and C2 be two sets of points such that. Then. 9. Training Data Stream ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 17
Provided by: Sha127
Category:

less

Transcript and Presenter's Notes

Title: On Demand Classification of Data Streams


1
On Demand Classification of Data Streams
  • Charu C. Aggarwal
  • Jiawei Han
  • Philip S. Yu

Proc. 2004 Int. Conf. on Knowledge Discovery and
Data Mining (KDD'04), Seattle, WA, Aug. 2004
Speaker Pei-Min Chou Date2005/04/01
2
Outline
  • Introduction
  • Supervised Micro-cluster
  • Snapshot
  • Maintenance Supervised Micro-cluster
  • Training Data Stream
  • Classification on Demand
  • Empirical Results

3
Introduction
  • Advances in data storage often grow without limit
    referred to as data streams
  • one-pass mining model
  • does not recognize the changes and it is too
    expensive to keep track of the entire history
  • static classification model likely to drop when
    there is a sudden burst
  • Our model
  • simultaneous training and testing streams
    used for dynamic classification of data sets

4
Supervised Micro-cluster Modify Micro-cluster
  • Only from training data and each with same class
  • Data streams
  • Multi-dimensional points with
    time stamps T1, Tk .
  • Each point contains d dimensions, i.e.,
  • A micro-cluster for n points is defined as a (2d
    4) tuple
  • - the sum of the squares of the data values
  • - the sum of the data values
  • - the sum of the squares of the time stamps
  • - the sum of the time stamps
  • the number of data points
  • -variable corresponding to class id corresponds
    to
  • the class label of that micro-cluster

5
Snapshot
  • not too expensive to keep track history
  • storing the behavior of the micro-clusters at
    different moments in time
  • if (t mod 2i) 0 but (t mod 2i1)! 0
  • reaches max capacity, the oldest snapshot in this
    frame is removed
  • geometric time frame
  • vary from 0 to a value no larger than log2(T),
  • T is the maximum length of the stream
  • maximum number
  • (max capacity)log2(T)

6
Maintenance Supervised Micro-clusters
  • Nearest neighbor and k-means algorithms
  • The initial micro-clusters is offline process
  • offline ---answers various user queries based on
    the stored summary statistics
  • When a new data point Xik arrives, it is either
    added to a micro-cluster, or a new micro-cluster
    is created

7
Classification on Demand
  • Construct
  • Find the correct time-horizon
  • The value of kfit
  • Large or small horizon be chosen
  • Test

8
Find the correct time-horizon
  • Macro-clusters are created over a user-specified
    time horizon h
  • Let S(tc) the snapshot of micro-clusters at time
    tc
  • S(tc-h) the snapshot of micro-clusters
    at time tc-h
  • The new set of micro-clusters N(tc-h) are created
    by subtracting S(tc-h) from S(tc)
  • Subtractive property
  • Let C1 and C2 be two sets of points such that
  • Then

9
Training Data Stream
  • A small portion of the stream is used for the
    process of horizon fitting stream segment
  • kfit the number of points in the data used and
    the value small as 1 of the data
  • remaining portion of the training stream is used
    for the creation and maintenance of the
    class-specific micro-clusters

10
The value of kfit
  • Horizon determined classification accuracy
  • Process executed periodically for changes
  • kfit should be small enough so that the points in
    it reflect the immediate locality of tc
  • Qfit pre-specified number of time units
  • a part of the training stream
  • the class labels are known a-priori
  • Nearest neighbor procedure (XeQfit)
  • Find the closest micro-cluster in N(tc,h) to X
  • compare the class label and true label

11
Large or small horizon be chosen
  • The accuracy of all the time horizons which are
    tracked by the geometric time frame are
    determined
  • The p time horizons which provide the greatest
    dynamic classification accuracy
  • by
  • First sight ---smallest
  • Stable stream ---large

12
Test
  • test stream is a separate process which is
    executed continuously throughout the algorithm
  • Insert Xt , nearest neighbor classication process
    is applied using each (Xt belong H)
  • results in the determination class lable
  • these p class labels reported as the relevant
    class

13
Empirical Results
  • Pentium III,512MB,WinXP
  • Both real and synthetic
  • Advantage
  • much higher classification accuracy
  • Good scalability in terms of dimensionality and
    the number of class labels
  • stable processing rate
  • Space-efficient

14
Experiment
15
Experiment
16
Experiment
Write a Comment
User Comments (0)
About PowerShow.com