On Demand Classification of Data Streams PowerPoint PPT Presentation

presentation player overlay
1 / 16
About This Presentation
Transcript and Presenter's Notes

Title: On Demand Classification of Data Streams


1
On Demand Classification of Data Streams
  • Charu C. Aggarwal
  • Jiawei Han
  • Philip S. Yu

Proc. 2004 Int. Conf. on Knowledge Discovery and
Data Mining (KDD'04), Seattle, WA, Aug. 2004
Speaker Pei-Min Chou Date2005/04/01
2
Outline
  • Introduction
  • Supervised Micro-cluster
  • Snapshot
  • Maintenance Supervised Micro-cluster
  • Training Data Stream
  • Classification on Demand
  • Empirical Results

3
Introduction
  • Advances in data storage often grow without limit
    referred to as data streams
  • one-pass mining model
  • does not recognize the changes and it is too
    expensive to keep track of the entire history
  • static classification model likely to drop when
    there is a sudden burst
  • Our model
  • simultaneous training and testing streams
    used for dynamic classification of data sets

4
Supervised Micro-cluster Modify Micro-cluster
  • Only from training data and each with same class
  • Data streams
  • Multi-dimensional points with
    time stamps T1, Tk .
  • Each point contains d dimensions, i.e.,
  • A micro-cluster for n points is defined as a (2d
    4) tuple
  • - the sum of the squares of the data values
  • - the sum of the data values
  • - the sum of the squares of the time stamps
  • - the sum of the time stamps
  • the number of data points
  • -variable corresponding to class id corresponds
    to
  • the class label of that micro-cluster

5
Snapshot
  • not too expensive to keep track history
  • storing the behavior of the micro-clusters at
    different moments in time
  • if (t mod 2i) 0 but (t mod 2i1)! 0
  • reaches max capacity, the oldest snapshot in this
    frame is removed
  • geometric time frame
  • vary from 0 to a value no larger than log2(T),
  • T is the maximum length of the stream
  • maximum number
  • (max capacity)log2(T)

6
Maintenance Supervised Micro-clusters
  • Nearest neighbor and k-means algorithms
  • The initial micro-clusters is offline process
  • offline ---answers various user queries based on
    the stored summary statistics
  • When a new data point Xik arrives, it is either
    added to a micro-cluster, or a new micro-cluster
    is created

7
Classification on Demand
  • Construct
  • Find the correct time-horizon
  • The value of kfit
  • Large or small horizon be chosen
  • Test

8
Find the correct time-horizon
  • Macro-clusters are created over a user-specified
    time horizon h
  • Let S(tc) the snapshot of micro-clusters at time
    tc
  • S(tc-h) the snapshot of micro-clusters
    at time tc-h
  • The new set of micro-clusters N(tc-h) are created
    by subtracting S(tc-h) from S(tc)
  • Subtractive property
  • Let C1 and C2 be two sets of points such that
  • Then

9
Training Data Stream
  • A small portion of the stream is used for the
    process of horizon fitting stream segment
  • kfit the number of points in the data used and
    the value small as 1 of the data
  • remaining portion of the training stream is used
    for the creation and maintenance of the
    class-specific micro-clusters

10
The value of kfit
  • Horizon determined classification accuracy
  • Process executed periodically for changes
  • kfit should be small enough so that the points in
    it reflect the immediate locality of tc
  • Qfit pre-specified number of time units
  • a part of the training stream
  • the class labels are known a-priori
  • Nearest neighbor procedure (XeQfit)
  • Find the closest micro-cluster in N(tc,h) to X
  • compare the class label and true label

11
Large or small horizon be chosen
  • The accuracy of all the time horizons which are
    tracked by the geometric time frame are
    determined
  • The p time horizons which provide the greatest
    dynamic classification accuracy
  • by
  • First sight ---smallest
  • Stable stream ---large

12
Test
  • test stream is a separate process which is
    executed continuously throughout the algorithm
  • Insert Xt , nearest neighbor classication process
    is applied using each (Xt belong H)
  • results in the determination class lable
  • these p class labels reported as the relevant
    class

13
Empirical Results
  • Pentium III,512MB,WinXP
  • Both real and synthetic
  • Advantage
  • much higher classification accuracy
  • Good scalability in terms of dimensionality and
    the number of class labels
  • stable processing rate
  • Space-efficient

14
Experiment
15
Experiment
16
Experiment
Write a Comment
User Comments (0)
About PowerShow.com