Title: Some recent advances in nearneighbor learning
1Some recent advances in near-neighbor learning
- Maya R. Gupta
- University of Washington
Eric Garcia Univ. Washington
Michael Friedlander Univ. British Columbia
William Mortenson Univ. Washington
Richard Olshen Stanford
Andrey Stroilov Google
Robert Gray Stanford
2Classification
THE PROBLEM
SOLUTIONS
3Classification
THE PROBLEM
SOLUTIONS
4Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
5Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
6Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
7Fit a model of the decision boundary (neural
nets, SVMs, decision trees)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
81-Nearest Neighbor Classifier
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
9Weighted Near-Neighbor Classifiers
WEIGHT
FEATURE
10Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
11Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
12Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
13Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
- little or no training needed
14Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
- little or no training needed
- easy to add training data
15Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
- little or no training needed
- easy to add training data
- soft decisions P(YX)
16Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
- little or no training needed
- easy add training data
- soft decisions P(YX)
- 10-fold CV on 35 benchmark datasets
- (Lam et al. IEEE Trans. PAMI 06)
- SVM (polynomial kernel) beat kNN on 18/35
datasets -
17Why use nonparametric neighborhood methods?
- intuitive base decisions on similar examples
- no assumed model, explicit use of the data
- most are consistent will converge to EYX
- little or no training needed
- easy add training data
- soft decisions
- classification or estimation
- adaptable speed vs. accuracy trade-off
18Bias, or failing on average
P(Y1X)
P(Y2X)
FEATURE
19Bias, or failing on average in 2D
Class 1 (crosses) Each feature drawn iid from
Normal(0, I)
Class 2 (circles) Each feature drawn iid from
Normal(0, 4I)
BAYES DECISION BOUNDARY
20Bias from the training sample distribution
- Symmetric kernels can have big bias problems
- A training points value depends on what other
training points we have
C
N
N
N
X
N
N
N
C
N
N
C
21Weight the data to balance around the test point
2/3
1/3
LINEAR INTERPOLATION EQUATIONS
22 Linear interpolation weights as kernel weights
Problem!
23 Linear interpolation weights as kernel weights
Problem!
C
N
N
C
2D, can use 3 neighbors but more training points
are informative
N
N
N
X
N
N
C
24k neighbors where k gt d 1
d1 equations k weights to find k gt d1
More variables than constraints. Underdetermined
system of equations. Many solutions.
25k neighbors where k gt d 1
d1 equations k weights to find k gt d1
More variables than constraints. Underdetermined
system of equations. Many solutions. Can add
more criteria for the weight solution.
Bonus!
26Example of multiple solutions
feature dimension
Find w1, w2, w3 to solve
Solutions (2/3, 1/3, 0) (2/3, 0, 1/3)
(2/3, 1/6, 1/6) etc.
27Maximize solution entropy to reduce variance
Define unique solution by using another
criterion Jaynes Principle of Maximum Entropy
- Weights training samples as equally as possible
- Pushes towards uniform dist., keeping variance
low
28Linear Interpolation with Maximum Entropy (Gupta,
Gray, Olshen, IEEE Trans. on PAMI 06)
29Linear Interpolation with Maximum Entropy (LIME)
TRY TO SOLVE LINEAR INTERPOLATION EQUATIONS
MAXIMIZE ENTROPY
TRADE-OFF PARAMETER
30How do we solve for the weights?
- separable convex objective function
- any convex optimizer will do
- we use MOSEK solver and AMPL mathematical
programming language
31Hypersphere simulation (Kohonen)
3220D feature space
(2000 training samples, 1000 test samples)
k-NN
Tricube kernel
Classification Error
LIME
Bayes Risk (theoretical best)
Number of neighbors in the neighborhood
33Performance as feature space dimension grows
100 training points 10,000 test points
34100 training Points 10,000 test points
k75 neighbors. Local?
35Error rates on benchmark datasets (LOOCV)
Iris Glass Iono Wine
Pima kNN 2.7 27.6 13.4 1.7
24.2 Tricube 4.0 27.1 11.7
2.2 24.6 LIME 2.0
25.7 6.8 1.1 22.1
36Classification example finding problems in gas
pipelines
gas pipeline Norsk
Electro Optikk Optopig
37Inside natural gas pipelines (data courtesy of
Norsk Electro-Optikk)
normal
magnetic flux linkage mark
Classify test image based on labelled training
data
???
corrosion blisters
weld cavity
38Pipeline image classification
96 x 128 images
- 22 features that match what humans see
- Gray-level co-occurrence matrix features
- local statistics
- global statistics
39Pipeline classification results
(OBrien et al. 2003, Gupta et al. 2006)
Images described as feature vectors 12 classes
normal, weld, etc. Misclassification costs
vary 228 labeled training samples,
cross-validation
40Other real application results
- Quality assessment of predicted protein
structures - (Cazzanti, Gupta, Malmstrom, Baker, IEEE
Workshop on Machine Learning for Signal
Processing 05) - LIME does slightly better than GMM, much better
than GLM - Classification of military signals
- from multi-channel audio
- (Smith and Atlas, 06)
- LIME does 10 better than GMM
image from www.plig.net
41Some nice theoretical results
- Law of large numbers for LIME weighted random
variables Additive noise averages out as number
of training samples increases - Consistency More data leads to the right answer
42Linear interpolation with Maximum Entropy (Gupta,
Gray, Olshen, 06, IEEE Trans. on PAMI)
43LIME creates data-adaptive exponential kernel
(Friedlander and Gupta, IEEE Trans. Information
Theory 06)
Weight
Feature dimension
44end of Act I (weights)
45How do you choose near-neighborsfor local
learning?
X
Feature 2
X
X
?
X
X
X
X
X
X
X
X
Feature 1
46How do you choose near-neighborsfor local
learning?
X
Feature 2
Choose a neighborhood that encloses the test
point.
X
X
?
X
X
X
X
X
X
X
X
Feature 1
47Enclosing k-NN neighborhood
Neighborhood includes the k nearest neighbors
where the kth neighbor is the first such that the
test point g is enclosed in the k nearest
neighbors convex hull.
48Enclosing k-NN neighborhood
Neighborhood includes the k nearest neighbors
where the kth neighbor is the first such that the
test point g is enclosed in the k nearest
neighbors convex hull.
(If the test sample cant be enclosed in the
convex hull of all the training points, choose k
to be the last neighbor to decrease the
distance to enclosure.)
49Expected size of enclosing k-NN
- 1-d case
- Assume Training Points Uniformly in R1
- Pr(okgtx) Pr(okltx) ½ ? Geometric(p1/2) ?
Ek 2 - En 12 3.
50Expected size of enclosing k-NN
- Given neighbors drawn uniformly around test point
in - d-dimensional feature space,
- E Neighbors 2d1
- as number of
- training samples
- n goes to infinity.
51Related work Sibsons natural neighbors
Proposed by Sibson (1981) for interpolation.
Good for more general local learning?
Neighborhood is any training sample whose
Voronoi cell touches Voronoi cell of the test
sample.
52Related work Sibsons natural neighbors
Proposed by Sibson (1981) for interpolation.
Good for more general local learning?
Neighborhood is any training sample whose
Voronoi cell touches Voronoi cell of the test
sample.
Proposed Natural neighbors inclusive- all
training samples as close as the furthest
natural neighbor.
53Color management color is hard
Problem How do you get a printer to print
colors correctly?
printer
on-screen
printed
54Learning LUTs for color management
24 bit RGB color patch
Color printer
printed color patch
Human eye
Measure CIELab values
55Learning LUTs for color management
8 bit RGB color patch
Color printer
printed color patch
Human eye
DEVICE DEPENDENT COLOR DESCRIPTION
Measure CIELab
DEVICE INDEPENDENT COLOR DESCRIPTION
56Learning LUTs for color management
8 bit RGB color patch
Color printer
printed color patch
Human eye
Measure CIELab
Goal Print a given CIELab value. Problem
What RGB value to input?
57Color management Step 1
- Print an image of RGB patches and measure the
output CIELab values
58Color management Step 2
- Given measured data of (RGB, CIELab) pairs,
estimate a regular 3D grid of CIELab points
and corresponding RGB values
VARIOUS MEASURED CIELab VALUES
REGULAR 3D GRID IN CIELab SPACE
(100,-50,-50)
(100,-50,-50)
(100,-50,-50)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(94, 13, 20)
(78, -40, 13)
(63, -36, 8)
(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(22, 78, -53)
593D grid maps desired CIELab colors to Input RGB
DESIRED CIELab
ESTIMATED RGB INPUT
(100,-50,-50)
(100,-50,-25)
(100,-50,0)
(56,156,182)
(78, 174,98)
(84,188,81)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(35, 104, 99)
(67,113,63)
(88, 142, 24)
(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(14,82,85)
(53, 96, 58)
(81, 103, 23)
60Color management Step 3
- Desired CIELab colors not on the grid are
interpolated to determine best input RGB
DESIRED CIELab
RGB TO INPUT TO DEVICE
(100,-50,-50)
(100,-50,-25)
(100,-50,0)
(56,156,182)
(78, 174,98)
(84,188,81)
X (83, -50,-18)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(35, 104, 99)
(67,113,63)
(88, 142, 24)
(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(14,82,85)
(53, 96, 58)
(81, 103, 23)
61 Color management Step 3
corresponds to
desired Lab color
estimated RGB color to send to printer
62Color management summary
- Step 1 Input RGB patches and measure CIELab
values - Step 2 Estimate RGB inputs corresponding to a 3D
CIELab grid - Step 3 Given a desired CIELab color,
interpolate the 3D grid for best RGB input
63Color management summary
- Step 1 Input RGB patches and measure CIELab
values - Step 2 Estimate RGB inputs corresponding to a 3D
CIELab grid - Step 3 Given a desired CIELab color,
interpolate the 3D grid for best RGB input
Bala showed that local linear regression
worked well with k15 neighbors (Digital Imaging
Handbook, 03) We compared local linear and
local ridge regression for k15, enclosing k-NN,
natural neighbors (and inclusive).
64Linear regression
65Linear regression
More robust Ridge regression
66Example color management results
(Epson Stylus 2200, 729 test patches, local ridge
regression)
67Enclosing neighborhoods
Adaptive no need to train for the neighborhood
size. Max-Variance Optimal can prove a bound on
the estimation variance given an enclosing
neighborhood. Results Good results for color
management regression and related color
enhancement regression problem.
68Some final words
Nearest-neighbors is competitive.
69Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well.
70Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well. Enclosing
neighborhoods is easy and useful in
low-dimensions.
71Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well. Enclosing
neighborhoods is easy and useful in
low-dimensions. Current research Provably
optimal weights Adaptive neighborhoods for
classification Similarity-based classification
72Key paper on LIME Gupta, Gray, Olshen, IEEE
Trans. PAMI, 2006.
Paper on enclosing neighborhoods Gupta, IEEE
Intl. Conf. on Image Proc., 2005.
- For related research, see
- idl.ee.washington.edu
- or email gupta_at_ee.washington.edu
73Likelihood of seeing m neighbors of class 1 out
of k
Near-neighbor model a test points near
neighbors are drawn from the same class
probability distribution as the test point.
X
O
X
X
X
?
X
X
74Likelihood of seeing m neighbors of class 1 out
of k
Near-neighbor model a test points near
neighbors are drawn from the same class
probability distribution as the test point.
X
O
X
X
X
?
X
X
75Near-neighbor learning
maximum likelihood estimates
76Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ?
77Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0.
78Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0. Ex 2 Use 1-NN to estimate
P(T)
T
H
H
X
T
H
T
79Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0. Ex 2 Use 1-NN to estimate
P(T) ML estimate P(Class Tails) 0.
T
H
H
T
X
H
T
Given weights on neighbors, is there a more
robust way to estimate the probability of each
class and classify? Proposed Bayesian minimum
expected risk for near-neighbor
classification (Gupta, Srivastava, Cazzanti IEEE
SSP 2005)
80Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Probability of P(tails) given one tail out of 10
flips
likelihood
0
1
P(tails)
81Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
relative entropy or mean-squared err.
82Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
Guess the pmf p that minimizes the expected
cost
relative entropy or mean-squared err.
83Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
Guess the pmf p that minimizes the expected
cost
relative entropy or mean-squared err.
Guess how you are going to judge!
84Ten near-neighbor example
One neighbor is tails, 9 heads. Whats
P(X is tails) ?
Likelihood Plot
likelihood
ML
MER
1
0
.1
ML estimate
(unbiased) (biased but lower variance)
MER estimate
85MER for near-neighbor (Gupta, Cazzanti,
Srivastava, IEEE SSP05, submitted for journal
pub.)
Given m tails out of k neighbors, P(tails) for
k-NN is
Given m tails out of k neighbors, P(tails) for
weighted k-NN is
86How much better is MER than ML?
PMF estimate with 100 training/100 test samples
on 3D Kohonen simulation
shapes Maximum likelihood lines BMER estimate
mean-squared error of class pmf
number of neighbors (k)
87Classifying with MER
88Minimum Expected Cost Classification
89Minimum Expected Cost Classification
90Minimum Expected Cost Classification
Surprise!
Why?
91Minimum Expected Cost Classification
Surprise!
92MER classification results
Classification with 1000 training/1000 test and
50,000 validation samples on 4D Kohonen
simulation.
45 25
ML Cost/ MER Cost
k-NN
5
equal costs
unequal costs