An Adaptive Nearest Neighbor Classification Algorithm for Data Streams PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: An Adaptive Nearest Neighbor Classification Algorithm for Data Streams


1
An Adaptive Nearest Neighbor Classification
Algorithm for Data Streams
  • Yan-Nei Law Carlo Zaniolo
  • University of California, Los Angeles
  • PKDD, Porto, 2005

2
Outline
  • Related Work
  • ANNCAD
  • Properties of ANNCAD
  • Conclusion

3
Classifying Data Streams
  • Problem Statement We seek an algorithm for
    classifying data streams with numerical
    attributes---will work for totally ordered
    domains too.
  • Desiderata
  • Fast update speed for newly arriving records.
  • Only require single pass of data.
  • Incremental algorithms are needed.
  • Coping with concept changes.
  • Classical mining algorithms were not designed for
    data streams and need to replaced or modified.

4
Classifying Data Streams Related Work
  • Hoeffding trees
  • VFDT and CVFDT build decision tree
    incrementally.
  • Require a large amount of examples to obtain a
    fair performance classifier.
  • Unsatisfied performance when training set is
    small.
  • Ensemble
  • Combine base models by voting technique.
  • Suitable for coping with concept drift.
  • Fail to provide a simple model and understanding
    of the problem.

5
State of the Art NearestNeighborhood Classifiers
  • Pros and cons
  • Strong intuitive appeal and simple to
    implement
  • - Fail to provide simple models/rules
  • - Expensive Computations
  • ANN Approximate Nearest Neighborhood with error
    guarantee 1e
  • Idea pre-processing the data by devising a data
    structure (e.g. ring-cover tree) to speed up the
    searchings.
  • Designed for stored data only.
  • Time for update the pre-processing step depends
    on size of data set which may be infinite.

6
Our Algorithm ANNCAD Adaptive NN
Classification Algorithm for Data Streams
  • Model building
  • Pre-assign classes to obtain an approximate
    result and provide simple models/rules.
  • Decompose the feature space to make
    classification decisions.
  • Akin to wavelets.
  • Classification
  • Find NN for classification adaptively.
  • progressively expand the searching
  • of nearby area of a test point (star).

7
Quantize Feature Space and Compute
Multi-resolution Coefficients
Quantize Feature Space and record information
into data arrays
A set of 100 two-class training points
Multi-resolution representation of a two-class
data set.
8
Building a Classifier
Label each block with its majority class
Label block only if C1st-C2nd gt 80
Hierarchical structure of ANNCAD Classifier
9
Decision Algorithm on the ANNCAD Hierarchy
The combined classifier over multiple levels
10
Incremental Update
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 0
6.75 2
3 0.25
3
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 1
6.75 2
3 0.5
3.0625













11
Concept Drift Adaptation by Exponential
Forgetting
  • Data Array ?, Factor 0???1
  • ?new ? ? ?old
  • No effect if no concept changes
  • Adapt quickly (exponentially) if concept changes
  • No extra memory needed (sliding window required.)

12
Grid Position and Resolution
  • Problem Neighborhood decision strongly depends
    on grid position
  • Solution Build several classifiers by shifting
    grid position by 1/n. Then combine the results by
    voting.
  • Thm. x test point, nd classifiers, b(x) Blocks
    containing x, then ? z??b(x),? y??b(x)
    dist(x,y)lt(11/n-1)dist(x,z).
  • In practice, only 2-3 classifiers can achieve a
    good result.

13
Properties of ANNCAD
  • Compact support locality property allows fast
    update
  • Dealing with noise can set a threshold for
    classification decision
  • Multi-resolution to control the fineness of the
    result, or optimize the system resources.
  • Low complexity (gd total number of cells)
  • Building classifier O(min(N,gd))
  • Testing O(log2(g)2d).
  • Updating log2(g)1.

14
Experiments
  • Synthetic Data
  • 3-d unit cube
  • Class distribution
  • class 0 inside sphere with radius 0.5 class 1
    outside
  • 3000 training examples
  • 1000 test examples
  • Exact ANN
  • Expand the searching area by double the radius
    until reaching some training point.
  • Classify the test point with the majority class.

(a) different initial resolutions.
(b) different ensembles.
15
Experiments (Cont)
  • Real Data 1 -- Letter Recognition
  • Objective identify a pixel displays as one of
    the 26 letter.
  • 16 numerical attributes to describe its pixel
    displays.
  • 15,000 training examples
  • 5,000 test examples
  • Add 5 noise by randomly assign class.
  • Grid size 16 units
  • Classifiers 2

16
ANNCAD Vs VFDT
  • Real Data 2 Forest Cover Type
  • Objective predict forest cover type.
  • 10 numerical attributes.
  • 12,000 training examples
  • 9,000 test examples
  • Grid size 32 unit
  • Classifiers 2

17
Concept Shift ANNCAD vs CVFDT
  • Real Data 3 Adult
  • Objective determine a person with salarygt50K
  • Concept Shift Simulation Group by races
  • ? 0.98
  • Grid Size 64
  • Classifier 2

18
Conclusion and Future Work
  • ANNCAD
  • an incremental classification algorithm to find
    adaptive NN
  • Suitable for mining data streams fast update
    speed
  • Exponential forgetting for concept shift/drift.
  • Future Work Detect concept shift/drift by
    changes in class label of blocks.

19
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com