Some recent advances in nearneighbor learning - PowerPoint PPT Presentation

1 / 92

About This Presentation

Title:

Some recent advances in nearneighbor learning

Description:

Robert Gray. Stanford. Richard Olshen. Stanford. Michael ... (Cazzanti, Gupta, Malmstrom, Baker, IEEE Workshop on Machine Learning for Signal Processing '05) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 93

Provided by: Maya49

Category:

more less

Transcript and Presenter's Notes

Title: Some recent advances in nearneighbor learning

1
Some recent advances in near-neighbor learning

Maya R. Gupta
University of Washington

Eric Garcia Univ. Washington
Michael Friedlander Univ. British Columbia
William Mortenson Univ. Washington
Richard Olshen Stanford
Andrey Stroilov Google
Robert Gray Stanford
2
Classification
THE PROBLEM
SOLUTIONS
3
Classification
THE PROBLEM
SOLUTIONS
4
Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
5
Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
6
Model the class conditionals as Gaussians (LDA,
QDA, GMM)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
7
Fit a model of the decision boundary (neural
nets, SVMs, decision trees)
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
8
1-Nearest Neighbor Classifier
X
Feature 2
O
X
O
X
O
O
O
O
X
O
X
X
X
X
X
O
O
X
O
O
?
X
O
O
X
Feature 1
9
Weighted Near-Neighbor Classifiers
WEIGHT
FEATURE
10
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples

11
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data

12
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX

13
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX
little or no training needed

14
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX
little or no training needed
easy to add training data

15
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX
little or no training needed
easy to add training data
soft decisions P(YX)

16
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX
little or no training needed
easy add training data
soft decisions P(YX)
10-fold CV on 35 benchmark datasets
(Lam et al. IEEE Trans. PAMI 06)
SVM (polynomial kernel) beat kNN on 18/35
datasets

17
Why use nonparametric neighborhood methods?

intuitive base decisions on similar examples
no assumed model, explicit use of the data
most are consistent will converge to EYX
little or no training needed
easy add training data
soft decisions
classification or estimation
adaptable speed vs. accuracy trade-off

18
Bias, or failing on average
P(Y1X)
P(Y2X)
FEATURE
19
Bias, or failing on average in 2D
Class 1 (crosses) Each feature drawn iid from
Normal(0, I)
Class 2 (circles) Each feature drawn iid from
Normal(0, 4I)
BAYES DECISION BOUNDARY
20
Bias from the training sample distribution

Symmetric kernels can have big bias problems
A training points value depends on what other
training points we have

C
N
N
N
X
N
N
N
C
N
N
C
21
Weight the data to balance around the test point
2/3
1/3
LINEAR INTERPOLATION EQUATIONS
22
Linear interpolation weights as kernel weights
Problem!
23
Linear interpolation weights as kernel weights
Problem!
C
N
N
C
2D, can use 3 neighbors but more training points
are informative
N
N
N
X
N
N
C
24
k neighbors where k gt d 1
d1 equations k weights to find k gt d1
More variables than constraints. Underdetermined
system of equations. Many solutions.
25
k neighbors where k gt d 1
d1 equations k weights to find k gt d1
More variables than constraints. Underdetermined
system of equations. Many solutions. Can add
more criteria for the weight solution.
Bonus!
26
Example of multiple solutions
feature dimension
Find w1, w2, w3 to solve
Solutions (2/3, 1/3, 0) (2/3, 0, 1/3)
(2/3, 1/6, 1/6) etc.
27
Maximize solution entropy to reduce variance
Define unique solution by using another
criterion Jaynes Principle of Maximum Entropy

Weights training samples as equally as possible
Pushes towards uniform dist., keeping variance
low

28
Linear Interpolation with Maximum Entropy (Gupta,
Gray, Olshen, IEEE Trans. on PAMI 06)
29
Linear Interpolation with Maximum Entropy (LIME)
TRY TO SOLVE LINEAR INTERPOLATION EQUATIONS
MAXIMIZE ENTROPY
TRADE-OFF PARAMETER
30
How do we solve for the weights?

separable convex objective function
any convex optimizer will do
we use MOSEK solver and AMPL mathematical
programming language

31
Hypersphere simulation (Kohonen)
32
20D feature space
(2000 training samples, 1000 test samples)
k-NN
Tricube kernel
Classification Error
LIME
Bayes Risk (theoretical best)
Number of neighbors in the neighborhood
33
Performance as feature space dimension grows
100 training points 10,000 test points
34
100 training Points 10,000 test points
k75 neighbors. Local?
35
Error rates on benchmark datasets (LOOCV)
Iris Glass Iono Wine
Pima kNN 2.7 27.6 13.4 1.7
24.2 Tricube 4.0 27.1 11.7
2.2 24.6 LIME 2.0
25.7 6.8 1.1 22.1
36
Classification example finding problems in gas
pipelines
gas pipeline Norsk
Electro Optikk Optopig
37
Inside natural gas pipelines (data courtesy of
Norsk Electro-Optikk)
normal
magnetic flux linkage mark
Classify test image based on labelled training
data
???
corrosion blisters
weld cavity
38
Pipeline image classification

km of image data

96 x 128 images

22 features that match what humans see
Gray-level co-occurrence matrix features
local statistics
global statistics

39
Pipeline classification results
(OBrien et al. 2003, Gupta et al. 2006)
Images described as feature vectors 12 classes
normal, weld, etc. Misclassification costs
vary 228 labeled training samples,
cross-validation
40
Other real application results

Quality assessment of predicted protein
structures
(Cazzanti, Gupta, Malmstrom, Baker, IEEE
Workshop on Machine Learning for Signal
Processing 05)
LIME does slightly better than GMM, much better
than GLM
Classification of military signals
from multi-channel audio
(Smith and Atlas, 06)
LIME does 10 better than GMM

image from www.plig.net
41
Some nice theoretical results

Law of large numbers for LIME weighted random
variables Additive noise averages out as number
of training samples increases
Consistency More data leads to the right answer

42
Linear interpolation with Maximum Entropy (Gupta,
Gray, Olshen, 06, IEEE Trans. on PAMI)
43
LIME creates data-adaptive exponential kernel
(Friedlander and Gupta, IEEE Trans. Information
Theory 06)
Weight
Feature dimension
44
end of Act I (weights)

Act II (neighborhoods)

45
How do you choose near-neighborsfor local
learning?
X
Feature 2
X
X
?
X
X
X
X
X
X
X
X
Feature 1
46
How do you choose near-neighborsfor local
learning?
X
Feature 2
Choose a neighborhood that encloses the test
point.
X
X
?
X
X
X
X
X
X
X
X
Feature 1
47
Enclosing k-NN neighborhood
Neighborhood includes the k nearest neighbors
where the kth neighbor is the first such that the
test point g is enclosed in the k nearest
neighbors convex hull.
48
Enclosing k-NN neighborhood
Neighborhood includes the k nearest neighbors
where the kth neighbor is the first such that the
test point g is enclosed in the k nearest
neighbors convex hull.
(If the test sample cant be enclosed in the
convex hull of all the training points, choose k
to be the last neighbor to decrease the
distance to enclosure.)
49
Expected size of enclosing k-NN

1-d case
Assume Training Points Uniformly in R1

Pr(okgtx) Pr(okltx) ½ ? Geometric(p1/2) ?
Ek 2
En 12 3.

50
Expected size of enclosing k-NN

Given neighbors drawn uniformly around test point
in
d-dimensional feature space,
E Neighbors 2d1
as number of
training samples
n goes to infinity.

51
Related work Sibsons natural neighbors
Proposed by Sibson (1981) for interpolation.
Good for more general local learning?
Neighborhood is any training sample whose
Voronoi cell touches Voronoi cell of the test
sample.

52
Related work Sibsons natural neighbors
Proposed by Sibson (1981) for interpolation.
Good for more general local learning?
Neighborhood is any training sample whose
Voronoi cell touches Voronoi cell of the test
sample.
Proposed Natural neighbors inclusive- all
training samples as close as the furthest
natural neighbor.
53
Color management color is hard
Problem How do you get a printer to print
colors correctly?
printer
on-screen
printed
54
Learning LUTs for color management
24 bit RGB color patch
Color printer
printed color patch
Human eye
Measure CIELab values
55
Learning LUTs for color management
8 bit RGB color patch
Color printer
printed color patch
Human eye
DEVICE DEPENDENT COLOR DESCRIPTION
Measure CIELab
DEVICE INDEPENDENT COLOR DESCRIPTION
56
Learning LUTs for color management
8 bit RGB color patch
Color printer
printed color patch
Human eye
Measure CIELab
Goal Print a given CIELab value. Problem
What RGB value to input?
57
Color management Step 1

Print an image of RGB patches and measure the
output CIELab values

58
Color management Step 2

Given measured data of (RGB, CIELab) pairs,
estimate a regular 3D grid of CIELab points
and corresponding RGB values

VARIOUS MEASURED CIELab VALUES
REGULAR 3D GRID IN CIELab SPACE
(100,-50,-50)
(100,-50,-50)
(100,-50,-50)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(94, 13, 20)
(78, -40, 13)
(63, -36, 8)

(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(22, 78, -53)
59
3D grid maps desired CIELab colors to Input RGB
DESIRED CIELab
ESTIMATED RGB INPUT
(100,-50,-50)
(100,-50,-25)
(100,-50,0)
(56,156,182)
(78, 174,98)
(84,188,81)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(35, 104, 99)
(67,113,63)
(88, 142, 24)

(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(14,82,85)
(53, 96, 58)
(81, 103, 23)
60
Color management Step 3

Desired CIELab colors not on the grid are
interpolated to determine best input RGB

DESIRED CIELab
RGB TO INPUT TO DEVICE
(100,-50,-50)
(100,-50,-25)
(100,-50,0)
(56,156,182)
(78, 174,98)
(84,188,81)
X (83, -50,-18)
(75,-50,-50)
(75,-50,-25)
(75,-50,0)
(35, 104, 99)
(67,113,63)
(88, 142, 24)

(50,-50,-50)
(50,-50,-25)
(50,-50,0)
(14,82,85)
(53, 96, 58)
(81, 103, 23)
61

Color management Step 3
corresponds to
desired Lab color
estimated RGB color to send to printer
62
Color management summary

Step 1 Input RGB patches and measure CIELab
values
Step 2 Estimate RGB inputs corresponding to a 3D
CIELab grid
Step 3 Given a desired CIELab color,
interpolate the 3D grid for best RGB input

63
Color management summary

Step 1 Input RGB patches and measure CIELab
values
Step 2 Estimate RGB inputs corresponding to a 3D
CIELab grid
Step 3 Given a desired CIELab color,
interpolate the 3D grid for best RGB input

Bala showed that local linear regression
worked well with k15 neighbors (Digital Imaging
Handbook, 03) We compared local linear and
local ridge regression for k15, enclosing k-NN,
natural neighbors (and inclusive).
64
Linear regression
65
Linear regression
More robust Ridge regression
66
Example color management results
(Epson Stylus 2200, 729 test patches, local ridge
regression)
67
Enclosing neighborhoods
Adaptive no need to train for the neighborhood
size. Max-Variance Optimal can prove a bound on
the estimation variance given an enclosing
neighborhood. Results Good results for color
management regression and related color
enhancement regression problem.
68
Some final words
Nearest-neighbors is competitive.

69
Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well.

70
Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well. Enclosing
neighborhoods is easy and useful in
low-dimensions.

71
Some final words
Nearest-neighbors is competitive.
Distance-decay weights are not as compelling as
they sound. LIME weighted nearest-neighbors
works consistently well. Enclosing
neighborhoods is easy and useful in
low-dimensions. Current research Provably
optimal weights Adaptive neighborhoods for
classification Similarity-based classification

72
Key paper on LIME Gupta, Gray, Olshen, IEEE
Trans. PAMI, 2006.
Paper on enclosing neighborhoods Gupta, IEEE
Intl. Conf. on Image Proc., 2005.

For related research, see
idl.ee.washington.edu
or email gupta_at_ee.washington.edu

73
Likelihood of seeing m neighbors of class 1 out
of k
Near-neighbor model a test points near
neighbors are drawn from the same class
probability distribution as the test point.
X
O
X
X
X
?
X
X
74
Likelihood of seeing m neighbors of class 1 out
of k
Near-neighbor model a test points near
neighbors are drawn from the same class
probability distribution as the test point.
X
O
X
X
X
?
X
X
75
Near-neighbor learning
maximum likelihood estimates
76
Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ?
77
Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0.
78
Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0. Ex 2 Use 1-NN to estimate
P(T)
T
H
H
X
T
H
T
79
Maximum likelihood estimates are not
robust(un-biased, but high-variance)
Ex 1 You flip a coin 1 time and get H.
Estimate the probability of tails P(T) ? ML
estimate P(T) 0. Ex 2 Use 1-NN to estimate
P(T) ML estimate P(Class Tails) 0.
T
H
H
T
X
H
T
Given weights on neighbors, is there a more
robust way to estimate the probability of each
class and classify? Proposed Bayesian minimum
expected risk for near-neighbor
classification (Gupta, Srivastava, Cazzanti IEEE
SSP 2005)
80
Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Probability of P(tails) given one tail out of 10
flips
likelihood
0
1
P(tails)
81
Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
relative entropy or mean-squared err.
82
Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
Guess the pmf p that minimizes the expected
cost
relative entropy or mean-squared err.
83
Minimum Expected Risk Estimation (MER)
Given data T, weight each possible pmf by the
likelihood of the data
The probability of pmf p is P(pT) P(T p).
(likelihood)
Let the cost of guessing the pmf is p, when the
truth is q be D(p,q).
Guess the pmf p that minimizes the expected
cost
relative entropy or mean-squared err.
Guess how you are going to judge!
84
Ten near-neighbor example
One neighbor is tails, 9 heads. Whats
P(X is tails) ?
Likelihood Plot
likelihood
ML
MER
1
0
.1
ML estimate

(unbiased) (biased but lower variance)

MER estimate
85
MER for near-neighbor (Gupta, Cazzanti,
Srivastava, IEEE SSP05, submitted for journal
pub.)
Given m tails out of k neighbors, P(tails) for
k-NN is
Given m tails out of k neighbors, P(tails) for
weighted k-NN is
86
How much better is MER than ML?
PMF estimate with 100 training/100 test samples
on 3D Kohonen simulation
shapes Maximum likelihood lines BMER estimate
mean-squared error of class pmf
number of neighbors (k)
87
Classifying with MER
88
Minimum Expected Cost Classification
89
Minimum Expected Cost Classification
90
Minimum Expected Cost Classification
Surprise!
Why?
91
Minimum Expected Cost Classification
Surprise!
92
MER classification results
Classification with 1000 training/1000 test and
50,000 validation samples on 4D Kohonen
simulation.
45 25
ML Cost/ MER Cost
k-NN
5
equal costs
unequal costs

Write a Comment

User Comments (0)