Title: A Survey on Distance Metric Learning Part 1
1A Survey on Distance Metric Learning (Part 1)
- Gerry Tesauro
- IBM T.J.Watson Research Center
2Acknowledgement
- Lecture material shamelessly adapted/stolen from
the following sources - Kilian Weinberger
- Survey on Distance Metric Learning slides
- IBM summer intern talk slides (Aug. 2006)
- Sam Roweis slides (NIPS 2006 workshop on
Learning to Compare Examples) - Yann LeCun talk slides (NIPS 2006 workshop on
Learning to Compare Examples)
3Outline
Part 1
- Motivation and Basic Concepts
- ML tasks where its useful to learn dist. metric
- Overview of Dimensionality Reduction
- Mahalanobis Metric Learning for Clustering with
Side Info (Xing et al.) - Pseudo-metric online learning (Shalev-Shwartz et
al.) - Neighbourhood Components Analysis (Golderberger
et al.), Metric Learning by Collapsing Classes
(Globerson Roweis) - Metric Learning for Kernel Regression (Weinberger
Tesauro) - Metric learning for RL basis function
construction (Keller et al.) - Similarity learning for image processing (LeCun
et al.)
Part 2
4Motivation
- Many ML algorithms and tasks require a distance
metric (equivalently, dissimilarity metric) - Clustering (e.g. k-means)
- Classification regression
- Kernel methods
- Nearest neighbor methods
- Document/text retrieval
- Find most similar fingerprints in DB to given
sample - Find most similar web pages to document/keywords
- Nonlinear dimensionality reduction methods
- Isomap, Maximum Variance Unfolding, Laplacian
Eigenmaps, etc.
5Motivation (2)
- Many problems may lack a well-defined, relevant
distance metric - Incommensurate features ? Euclidean distance not
meaningful - Side information ? Euclidean distance not
relevant - Learning distance metrics may thus be desirable
- A sensible similarity/distance metric may be
highly task-dependent or semantic-dependent - What do these data points mean?
- What are we using the data for?
6Which images are most similar?
7It depends ...
centered
left
right
8male
female
It depends ...
9... what you are looking for
student
professor
10... what you are looking for
nature background
plain background
11Key DML Concept Mahalanobis distance metric
- The simplest mapping is a linear transformation
12Mahalanobis distance metric
- The simplest mapping is a linear transformation
Algorithms can learn both matrices
13gt5 Minutes Introduction to Dimensionality
Reduction
14How can the dimensionality be reduced?
- eliminate redundant features
- eliminate irrelevant features
- extract low dimensional structure
15Notation
Input
with
Output
Embedding principle
Nearby points remain nearby, distant points
remain distant. Estimate r.
16Two classes of DR algorithms
Linear
Non-Linear
17Linear dimensionality reduction
18Principal Component Analysis
(Jolliffe 1986)
Project data into subspace of maximum variance.
19Optimization
20Optimization
Eigenvalue solution
21Facts about PCA
- Eigenvectors of covariance matrix C
- Minimizes ssq reconstruction error
- Dimensionality r can be estimated from
eigenvalues of C - PCA requires meaningful scaling of input features
22Multidimensional Scaling (MDS)
23Multidimensional Scaling (MDS)
24Multidimensional Scaling (MDS)
inner product matrix
25Multidimensional Scaling (MDS)
- equivalent to PCA
- use eigenvectors of inner-product matrix
- requires only pairwise distances
26Non-linear dimensionality reduction
27Non-linear dimensionality reduction
28From subspace to submanifold
We assume the data is sampled from some manifold
with lower dimensional degree of freedom. How can
we find a truthful embedding?
29Approximate manifold with neighborhood graph
30Approximate manifold with neighborhood graph
31Isomap
Tenenbaum et al 2000
geodesic distance
- Compute shortest path between all inputs
- Create geodesic distance matrix
- Perform MDS with geodesic distances
32Locally Linear Embedding (LLE)
- Maximize pairwise distances
- Preserve local distances and angles
- Unfolding by semidefinite programming
Roweis and Saul 2000
33Maximum Variance Unfolding (MVU)
Weinberger and Saul 2004
34Maximum Variance Unfolding (MVU)
Weinberger and Saul 2004
35Optimization problem
unfold data by maximizing pairwise distances
Preserve local distances
36Optimization problem
center output (translation invariance)
37Optimization problem
Problem Optimization non-convex
multiple local minima
38Optimization problem
Solution Change of notation
single global minimum
39Unfolding the swiss-roll
40Mahalanobis Metric Learning for Clustering with
Side Information (Xing et al. 2003)
- Exemplars xi , i1,,N plus two types of side
info - Similar set S (xi , xj ) s.t. xi and xj
are similar (e.g. same class) - Dissimilar set D (xi , xj ) s.t. xi and
xj are dissimilar - Learn optimal Mahalanobis matrix M
- D2ij (xi xj)T M (xi xj)
(global dist. fn.) - Goal keep all pairs of similar points close,
- while separating all dissilimar
pairs. - Formulate as a constrained convex programming
problem - minimize the distance between the data pairs in S
- Subject to data pairs in D are well separated
41MMC-SI (Contd)
- Objective of learning
- M is positive semi-definite
- Ensure non negativity and triangle inequality of
the metric - The number of parameters is quadratic in the
number of features - Difficult to scale to a large number of features
- Significant danger of overfitting small datasets
42Mahalanobis Metric for Clustering (MMC-SI)
Xing et al., NIPS 2002
43MMC-SI
Move similarly labeled inputs together
44MMC-SI
Move different labeled inputs apart
45Convex optimization problem
46Convex optimization problem
target Mahalanobis matrix
47Convex optimization problem
pushing differently labeled inputs apart
48Convex optimization problem
pulling similar points together
49Convex optimization problem
ensuring positive semi-definiteness
50Convex optimization problem
51Two convex sets
Set of all matrices that satisfy constraint 1
Cone of PSD matrices
52Convex optimization problem
convex objective
convex constraints
53Gradient Alternating Projection
54Gradient Alternating Projection
Take step along gradient.
55Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
56Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
Project onto PSD cone.
57Gradient Alternating Projection
Take step along gradient.
Project onto constraint satisfying sub-space.
Project onto PSD cone.
Algorithm is guaranteed to converge to optimal
solution
58Mahalanobis Metric Learning Example I
(b) Data scaled by the global metric
- Data Dist. of the original dataset
- Keep all the data points within the same classes
close - Separate all the data points from different
classes
59Mahalanobis Metric Learning Example II
(c) Rescaling by learned diagonal M
(b) rescaling by learned full M
- Diagonal distance metric M can simplify
computation, but could lead to disastrous results
60Summary of Xing et al 2002
- Learns Mahalanobis metric
- Well suited for clustering
- Can be kernelized
- Optimization problem is convex
- Algorithm is guaranteed to converge
- Assumes data to be uni-modal
61POLA(Pseudo-metric online learning algorithm)
Shalev-Shwartz et al, ICML 2004
62POLA (Pseudo-metric online learning algorithm)
This time the inputs are accessed two at a time.
63POLA (Pseudo-metric online learning algorithm)
Differently labeled inputs are separated.
64POLA (Pseudo-metric online learning algorithm)
65POLA (Pseudo-metric online learning algorithm)
Similarly labeled inputs are moved closer.
66Margin
67Convex optimization
Both are convex!!
68Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
69Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
Repeat with new constraints
70Alternating Projection
Initialize inside PSD cone
Project onto constraint - satisfying
hyperplane and back
Repeat with new constraints
If solution exists, algorithm converges inside
intersection.
71Theoretical Guarantees
Provided global solution exists
Online-version has an upper bound on accumulated
violation of threshold.
Batch-version converges after finite number of
passes over data.
72Summary of POLA
- Learns Mahalanobis metric
- Online algorithm
- Can also be kernelized
- Introduces a margin
- Algorithm converges if solution exists
- Assumes data to be unimodal
73Neighborhood Component Analysis
Distance metric for visualization and kNN
(Goldberger et. al. 2004)