Title: Nearest Neighbor Editing and Condensing Techniques
1Nearest Neighbor Editing and Condensing
Techniques
Organization
- Nearest Neighbor Revisited
- Condensing Techniques
- Proximity Graphs and Decision Boundaries
- Editing Techniques
Last updated Oct. 7, 2005
2Nearest Neighbour Rule
Non-parametric pattern classification. Consider a
two class problem where each sample consists of
two measurements (x,y).
k 1
For a given query point q, assign the class of
the nearest neighbour.
k 3
Compute the k nearest neighbours and assign the
class by majority vote.
3Example Digit Recognition
- Yann LeCunn MNIST Digit Recognition
- Handwritten digits
- 28x28 pixel images d 784
- 60,000 training samples
- 10,000 test samples
- Nearest neighbour is competitive
Test Error Rate () Test Error Rate ()
Linear classifier (1-layer NN) 12.0
K-nearest-neighbors, Euclidean 5.0
K-nearest-neighbors, Euclidean, deskewed 2.4
K-NN, Tangent Distance, 16x16 1.1
K-NN, shape context matching 0.67
1000 RBF linear classifier 3.6
SVM deg 4 polynomial 1.1
2-layer NN, 300 hidden units 4.7
2-layer NN, 300 HU, deskewing 1.6
LeNet-5, distortions 0.8
Boosted LeNet-4, distortions 0.7
4Nearest Neighbour Issues
- Expensive
- To determine the nearest neighbour of a query
point q, must compute the distance to all N
training examples - Pre-sort training examples into fast data
structures (kd-trees) - Compute only an approximate distance (LSH)
- Remove redundant data (condensing)
- Storage Requirements
- Must store all training data P
- Remove redundant data (condensing)
- Pre-sorting often increases the storage
requirements - High Dimensional Data
- Curse of Dimensionality
- Required amount of training data increases
exponentially with dimension - Computational cost also increases dramatically
- Partitioning techniques degrade to linear search
in high dimension
5Exact Nearest Neighbour
- Asymptotic error (infinite sample size) is less
than twice the Bayes classification error - Requires a lot of training data
- Expensive for high dimensional data (dgt20?)
- O(Nd) complexity for both storage and query time
- N is the number of training examples, d is the
dimension of each sample - This can be reduced through dataset
editing/condensing
6Decision Regions
Each cell contains one sample, and every location
within the cell is closer to that sample than to
any other sample. A Voronoi diagram divides the
space into such cells.
Every query point will be assigned the
classification of the sample within that cell.
The decision boundary separates the class regions
based on the 1-NN decision rule. Knowledge of
this boundary is sufficient to classify new
points. The boundary itself is rarely computed
many algorithms seek to retain only those points
necessary to generate an identical boundary.
7Condensing
- Aim is to reduce the number of training samples
- Retain only the samples that are needed to define
the decision boundary - This is reminiscent of a Support Vector Machine
- Decision Boundary Consistent a subset whose
nearest neighbour decision boundary is identical
to the boundary of the entire training set - Consistent Set --- the smallest subset of the
training data that correctly classifies all of
the original training data - Minimum Consistent Set smallest consistent set
Original data
Condensed data
Minimum Consistent Set
8Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
Produces consistent set
9Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
10Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
11Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
12Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
13Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
14Condensing
- Condensed Nearest Neighbour (CNN) Hart 1968
- Incremental
- Order dependent
- Neither minimal nor decision boundary consistent
- O(n3) for brute-force method
- Can follow up with reduced NN Gates72
- Remove a sample if doing so does not cause any
incorrect classifications
- Initialize subset with a single training example
- Classify all remaining samples using the subset,
and transfer any incorrectly classified samples
to the subset - Return to 2 until no transfers occurred or the
subset is full
15Proximity Graphs
- Condensing aims to retain points along the
decision boundary - How to identify such points?
- Neighbouring points of different classes
- Proximity graphs provide various definitions of
neighbour
NNG Nearest Neighbour Graph MST Minimum
Spanning Tree RNG Relative Neighbourhood
Graph GG Gabriel Graph DT Delaunay
Triangulation (neighbours of a 1NN-classifier)
16Proximity Graphs Delaunay
- The Delaunay Triangulation is the dual of the
Voronoi diagram - Three points are each others neighbours if their
tangent sphere contains no other points - Voronoi condensing retain those points whose
neighbours (as defined by the Delaunay
Triangulation) are of the opposite class - The decision boundary is identical
- Conservative subset
- Retains extra points
- Expensive to compute in high dimensions
17Proximity Graphs Gabriel
- The Gabriel graph is a subset of the Delaunay
Triangulation (some decision boundary might be
missed) - Points are neighbours only if their (diametral)
sphere of influence is empty - Does not preserve the identical decision
boundary, but most changes occur outside the
convex hull of the data points - Can be computed more efficiently
Green lines denote Tomek links
18(No Transcript)
19Not a Gabriel Edge
20Proximity Graphs RNG
- The Relative Neighbourhood Graph (RNG) is a
subset of the Gabriel graph - Two points are neighbours if the lune defined
by the intersection of their radial spheres is
empty - Further reduces the number of neighbours
- Decision boundary changes are often drastic, and
not guaranteed to be training set consistent
Gabriel edited
RNG edited not consistent
21Dataset Reduction Editing
- Training data may contain noise, overlapping
classes - starting to make assumptions about the underlying
distributions - Editing seeks to remove noisy points and produce
smooth decision boundaries often by retaining
points far from the decision boundaries - Results in homogenous clusters of points
22Wilson Editing
- Wilson 1972
- Remove points that do not agree with the majority
of their k nearest neighbours
Earlier example
Overlapping classes
Original data
Original data
Wilson editing with k7
Wilson editing with k7
23Multi-edit
- Diffusion divide data into N 3 random subsets
- Classification Classify Si using 1-NN with
S(i1)Mod N as the training set (i 1..N) - Editing Discard all samples incorrectly
classified in (2) - Confusion Pool all remaining samples into a new
set - Termination If the last I iterations produced no
editing then end otherwise go to (1)
- Multi-edit Devijer Kittler 79
- Repeatedly apply Wilson editing to random
partitions - Classify with the 1-NN rule
- Approximates the error rate of the Bayes decision
rule
Multi-edit, 8 iterations last 3 same
24Combined Editing/Condensing
- First edit the data to remove noise and smooth
the boundary - Then condense to obtain a smaller subset
25Where are we with respect to NN?
- Simple method, pretty powerful rule
- Very popular in text mining (?seems to work well
for this task) - Can be made to run fast
- Requires a lot of training data
- Edit to reduce noise, class overlap, overfitting
- Condense to remove data that are not needed to
enhance speed
26Problems when using k-NN in Practice
- What distance measure to use?
- Often Euclidean distance is used
- Locally adaptive metrics
- More complicated with non-numeric data, or when
different dimensions have different scales - Choice of k?
- Cross-validation
- 1-NN often performs well in practice
- k-NN needed for overlapping classes
- Reduce k-NN problem to 1-NN through dataset
editing