An Adaptive Nearest Neighbor Classification Algorithm for Data Streams presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

1
An Adaptive Nearest Neighbor Classification
Algorithm for Data Streams

Yan-Nei Law Carlo Zaniolo
University of California, Los Angeles
PKDD, Porto, 2005

2
Outline

Related Work
ANNCAD
Properties of ANNCAD
Conclusion

3
Classifying Data Streams

Problem Statement We seek an algorithm for
classifying data streams with numerical
attributes---will work for totally ordered
domains too.
Desiderata
Fast update speed for newly arriving records.
Only require single pass of data.
Incremental algorithms are needed.
Coping with concept changes.
Classical mining algorithms were not designed for
data streams and need to replaced or modified.

4
Classifying Data Streams Related Work

Hoeffding trees
VFDT and CVFDT build decision tree
incrementally.
Require a large amount of examples to obtain a
fair performance classifier.
Unsatisfied performance when training set is
small.
Ensemble
Combine base models by voting technique.
Suitable for coping with concept drift.
Fail to provide a simple model and understanding
of the problem.

5
State of the Art NearestNeighborhood Classifiers

Pros and cons
Strong intuitive appeal and simple to
implement
- Fail to provide simple models/rules
- Expensive Computations
ANN Approximate Nearest Neighborhood with error
guarantee 1e
Idea pre-processing the data by devising a data
structure (e.g. ring-cover tree) to speed up the
searchings.
Designed for stored data only.
Time for update the pre-processing step depends
on size of data set which may be infinite.

6
Our Algorithm ANNCAD Adaptive NN
Classification Algorithm for Data Streams

Model building
Pre-assign classes to obtain an approximate
result and provide simple models/rules.
Decompose the feature space to make
classification decisions.
Akin to wavelets.
Classification
Find NN for classification adaptively.
progressively expand the searching
of nearby area of a test point (star).

7
Quantize Feature Space and Compute
Multi-resolution Coefficients
Quantize Feature Space and record information
into data arrays
A set of 100 two-class training points
Multi-resolution representation of a two-class
data set.
8
Building a Classifier
Label each block with its majority class
Label block only if C1st-C2nd gt 80
Hierarchical structure of ANNCAD Classifier
9
Decision Algorithm on the ANNCAD Hierarchy
The combined classifier over multiple levels
10
Incremental Update
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 0
6.75 2
3 0.25
3
0 8 8 0
10 9 0 0
10 2 1 0
0 0 0 1
6.75 2
3 0.5
3.0625

11
Concept Drift Adaptation by Exponential
Forgetting

Data Array ?, Factor 0???1
?new ? ? ?old
No effect if no concept changes
Adapt quickly (exponentially) if concept changes
No extra memory needed (sliding window required.)

12
Grid Position and Resolution

Problem Neighborhood decision strongly depends
on grid position
Solution Build several classifiers by shifting
grid position by 1/n. Then combine the results by
voting.
Thm. x test point, nd classifiers, b(x) Blocks
containing x, then ? z??b(x),? y??b(x)
dist(x,y)lt(11/n-1)dist(x,z).
In practice, only 2-3 classifiers can achieve a
good result.

13
Properties of ANNCAD

Compact support locality property allows fast
update
Dealing with noise can set a threshold for
classification decision
Multi-resolution to control the fineness of the
result, or optimize the system resources.
Low complexity (gd total number of cells)
Building classifier O(min(N,gd))
Testing O(log2(g)2d).
Updating log2(g)1.

14
Experiments

Synthetic Data
3-d unit cube
Class distribution
class 0 inside sphere with radius 0.5 class 1
outside
3000 training examples
1000 test examples
Exact ANN
Expand the searching area by double the radius
until reaching some training point.
Classify the test point with the majority class.

(a) different initial resolutions.
(b) different ensembles.
15
Experiments (Cont)

Real Data 1 -- Letter Recognition
Objective identify a pixel displays as one of
the 26 letter.
16 numerical attributes to describe its pixel
displays.
15,000 training examples
5,000 test examples
Add 5 noise by randomly assign class.
Grid size 16 units
Classifiers 2

16
ANNCAD Vs VFDT

Real Data 2 Forest Cover Type
Objective predict forest cover type.
10 numerical attributes.
12,000 training examples
9,000 test examples
Grid size 32 unit
Classifiers 2

17
Concept Shift ANNCAD vs CVFDT

Real Data 3 Adult
Objective determine a person with salarygt50K
Concept Shift Simulation Group by races
? 0.98
Grid Size 64
Classifier 2

18
Conclusion and Future Work

ANNCAD
an incremental classification algorithm to find
adaptive NN
Suitable for mining data streams fast update
speed
Exponential forgetting for concept shift/drift.
Future Work Detect concept shift/drift by
changes in class label of blocks.

THANK YOU!

Write a Comment

User Comments (0)

About PowerShow.com

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams PowerPoint PPT Presentation