Title: An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
1An Adaptive Nearest Neighbor Classification
Algorithm for Data Streams
- Yan-Nei Law Carlo Zaniolo
- University of California, Los Angeles
- PKDD, Porto, 2005
2Outline
- Related Work
- ANNCAD
- Properties of ANNCAD
- Conclusion
3Classifying Data Streams
- Problem Statement We seek an algorithm for
classifying data streams with numerical
attributes---will work for totally ordered
domains too. - Desiderata
- Fast update speed for newly arriving records.
- Only require single pass of data.
- Incremental algorithms are needed.
- Coping with concept changes.
- Classical mining algorithms were not designed for
data streams and need to replaced or modified.
4Classifying Data Streams Related Work
- Hoeffding trees
- VFDT and CVFDT build decision tree
incrementally. - Require a large amount of examples to obtain a
fair performance classifier. - Unsatisfied performance when training set is
small. - Ensemble
- Combine base models by voting technique.
- Suitable for coping with concept drift.
- Fail to provide a simple model and understanding
of the problem.
5State of the Art NearestNeighborhood Classifiers
- Pros and cons
- Strong intuitive appeal and simple to
implement - - Fail to provide simple models/rules
- - Expensive Computations
- ANN Approximate Nearest Neighborhood with error
guarantee 1e - Idea pre-processing the data by devising a data
structure (e.g. ring-cover tree) to speed up the
searchings. - Designed for stored data only.
- Time for update the pre-processing step depends
on size of data set which may be infinite.
6Our Algorithm ANNCAD Adaptive NN
Classification Algorithm for Data Streams
- Model building
- Pre-assign classes to obtain an approximate
result and provide simple models/rules. - Decompose the feature space to make
classification decisions. - Akin to wavelets.
- Classification
- Find NN for classification adaptively.
- progressively expand the searching
- of nearby area of a test point (star).
7Quantize Feature Space and Compute
Multi-resolution Coefficients
Quantize Feature Space and record information
into data arrays
A set of 100 two-class training points
Multi-resolution representation of a two-class
data set.
8Building a Classifier
Label each block with its majority class
Label block only if C1st-C2nd gt 80
Hierarchical structure of ANNCAD Classifier
9Decision Algorithm on the ANNCAD Hierarchy
The combined classifier over multiple levels
10Incremental Update
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 0
6.75 2
3 0.25
3
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 1
6.75 2
3 0.5
3.0625
11Concept Drift Adaptation by Exponential
Forgetting
- Data Array ?, Factor 0???1
- ?new ? ? ?old
- No effect if no concept changes
- Adapt quickly (exponentially) if concept changes
- No extra memory needed (sliding window required.)
12Grid Position and Resolution
- Problem Neighborhood decision strongly depends
on grid position - Solution Build several classifiers by shifting
grid position by 1/n. Then combine the results by
voting. - Thm. x test point, nd classifiers, b(x) Blocks
containing x, then ? z??b(x),? y??b(x)
dist(x,y)lt(11/n-1)dist(x,z). - In practice, only 2-3 classifiers can achieve a
good result.
13Properties of ANNCAD
- Compact support locality property allows fast
update - Dealing with noise can set a threshold for
classification decision - Multi-resolution to control the fineness of the
result, or optimize the system resources. - Low complexity (gd total number of cells)
- Building classifier O(min(N,gd))
- Testing O(log2(g)2d).
- Updating log2(g)1.
14Experiments
- Synthetic Data
- 3-d unit cube
- Class distribution
- class 0 inside sphere with radius 0.5 class 1
outside - 3000 training examples
- 1000 test examples
- Exact ANN
- Expand the searching area by double the radius
until reaching some training point. - Classify the test point with the majority class.
(a) different initial resolutions.
(b) different ensembles.
15Experiments (Cont)
- Real Data 1 -- Letter Recognition
- Objective identify a pixel displays as one of
the 26 letter. - 16 numerical attributes to describe its pixel
displays. - 15,000 training examples
- 5,000 test examples
- Add 5 noise by randomly assign class.
- Grid size 16 units
- Classifiers 2
16ANNCAD Vs VFDT
- Real Data 2 Forest Cover Type
- Objective predict forest cover type.
- 10 numerical attributes.
- 12,000 training examples
- 9,000 test examples
- Grid size 32 unit
- Classifiers 2
17Concept Shift ANNCAD vs CVFDT
- Real Data 3 Adult
- Objective determine a person with salarygt50K
- Concept Shift Simulation Group by races
- ? 0.98
- Grid Size 64
- Classifier 2
18Conclusion and Future Work
- ANNCAD
- an incremental classification algorithm to find
adaptive NN - Suitable for mining data streams fast update
speed - Exponential forgetting for concept shift/drift.
- Future Work Detect concept shift/drift by
changes in class label of blocks.
19