Title: Advanced%20Machine%20Learning%20
1Advanced Machine Learning Perception
Instructor Tony Jebara
2Topic 13
- Manifolds Continued and Spectral Clustering
- Convex Invariance Learning (CoIL)
- Kernel PCA (KPCA)
- Spectral Clustering N-Cuts
3Manifolds Continued
- PCA linear manifold
- MDS get inter-point distances, find 2D data with
same - LLE mimic neighborhoods using low dimensional
vectors - GTM fit a grid of Gaussians to data via
nonlinear warp - Linear PCA after Nonlinear normalization/invarianc
e of data - Manifold is Linear PCA in Hilbert space (Kernels)
- Spectral Clustering in Hilbert space
4Convex Invariance Learning
- PCA is appropriate for finding a linear manifold
- Variation in data is only modeled linearly
- But, many problems are nonlinear
- However, the nonlinear variations may be
irrelevant - Images morph, rotate, translate, zoom
- Audio pitch changes, ambient acoustics
- Video motion, camera view, angles
- Genomics proteins fold, insertions,
deletions - Databases fields swapped, formats, scaled
- Imagine a Gremlin is corrupting your data by
multiplying - each input vector Xt by a type of matrix At to
give AtXt - Idea remove nonlinear irrelevant variations
before PCA - But, make this part of PCA optimization, not
pre-processing
5Convex Invariance Learning
- Example of irrelevant variation in our data
- permutation in image data each image Xt is
multiplied - by a permutation matrix At by gremlin. Must
clean it. - When we convert images to a vector, we are
assuming - arbitrary meaningless ordering (like Gremlin
mixing order) - This arbitrary ordering causes wild
nonlinearities (manifold) - We should not trust ordering, assume gremlin has
- permuted it with arbitrary permutation matrix
6Permutation Invariance
- Permutation is irrelevant variation in our data.
- Gremlin is permuting fields in our input vectors
- So, view a datum as Bag of Vectors instead
single vector - i.e. grayscale image Set of Vectors or Bag
of Pixels - N pixels, each is a D3 XYI tuple
- matrix Ai by gremlin. Must clean it.
- Treat each input as permutable Bag of Pixels
x
x
x
7Optimal Permutation
- Vectorization / Rasterization uses index in
image - to sort pixels into large vector.
- If we knew optimal correspondence could fix
- sorting pixels in bag into large vector more
- appropriately
- we dont know it, must learn it
8PCA on Permutated Data
- In non-permuted vector images, linear changes
eigenvectors are additions deletions of
intensities (bad!). Translating, raising
eyebrows, etc. erasing redrawing - In bag of pixels (vectorized only after knowing
optimal permutation) get linear changes
eigenvectors are morphings, warpings, jointly
spatial intensity change
9Permutation as a Manifold
- Assume order unknown. Set of Vectors or Bag of
Pixels - Get permutational invariance (order doesnt
matter) - Cant represent invariance by single X vector
point in DxN - space since we dont know the ordering
- Get permutation invariance by X spanning all
possible - reorderings. Multiply X by unknown A matrix
(permutation - or doubly-stochastic)
x
x
x
x
x
10Invariant Paths as Matrix Ops
- Move vector along manifold by multiplying by
matrix - Restrict A to be permutation matrix (operator)
- Resulting manifold of configurations is orbit
if A is group - Or, for smooth manifold, make A doubly-stochastic
matrix - Endow each image in dataset with own
transformation matrix
. Each is now a bag or manifold
11A Dataset of Invariant Manifolds
- E.g. assume model is PCA, learn 2D subspace of 3D
data - Permutation lets points move independently along
paths - Find PCA after moving to form tight 2D subspace
- More generally, move along manifolds to improve
fit of any - model (PCA, SVM, probability density, etc.)
12Optimizing the Permutations
- Optimize modeling cost linear constraints on
matrices - Estimate transformation
parameters - and model parameters (PCA, Gaussian,
SVM) - Cost on matrices A emerges from modeling
criterion - Typically, get a Convex Cost with Convex Hull of
- Constraints (Unique!)
- Since A matrices are soft permutation
- matrices (doubly-stochastic) we have
13Example Cost Gaussian Mean
- Maximum Likelihood Gaussian Mean Model
- Theorem 1 C(A) is convex in A (Convex Program)
- Can solve via a quadratic program on the A
matrices - Minimizing the trace of a covariance tries to
pull the data spherically towards a common mean
14Example Cost Gaussian Cov
- Theorem 2 Regularized log determinant of
covariance is - convex. Equivalently, minimize
- Theorem 3 Cost non-quadratic but upper boundable
by - quad. Iteratively solve with QP with
variational bound - Mining determinant flattens data into low volume
pancake
15Example Cost Fisher Discrimin.
- Find linear Fisher Discriminant model w that
- maximizes ratio of between within-class
scatter - For discriminative invariance, transformation
matrices - should increase between-class scatter
(numerator) and - should reduce within class scatter
(denominator) - Minimizing above permutes data to make
classification easy
x
x
x
x
x
x
x
x
x
x
16Interpreting C(A)
- Maximum Likelihood Mean
- Permute data towards common mean
- Maximum Likelihood Mean Covariance
- Permute data towards flat subspace
- Pushes energy into few eigenvectors
- Great as pre-processing before PCA
- Fisher Discriminant
- Permute data towards two flat
- subspaces while repelling away
- from each others means
17SMO Optimization of QP
- Quadratic Programming used for all C(A) since
- Gaussian Mean quadratic
- Gaussian Covariance upper boundable by quadratic
- Fisher Discriminant upper boundable by quadratic
- Use Sequential Minimal Optimization
- axis parallel optimization, pick axes to update,
- ensure constraints not violated
- Soft permutation matrix 4 constraints
- or 4 entries at a time
18XY Digits Permuted PCA
20 Images of 3 and 9 Each is 70 (x,y) dots No
order on the dots PCA compress with
same number of Eigenvectors Convex Program
first estimates the permutation ? better
reconstruction
Original
PCA
Permuted PCA
19Interpolation
Intermediate images are smooth morphs Points
nicely corresponded Spatial morphing versus
redrawing No ghosting
20XYI Faces Permuted PCA
Original PCA Permuted
Bag-of XYI Pixels
PCA
2000 XYI Pixels Compress to 20 dims Improve
squared error of PCA by Almost 3 orders of
magnitude x103
21XYI Multi-Faces Permuted PCA
/- Scaling on Eigenvector
Top 5 Eigenvectors All just linear variations
in bag of XYI pixels Vectorization nonlinear need
s huge of eigenvectors
22XYI Multi-Faces Permuted PCA
/- Scaling on Eigenvector
Next 5 Eigenvectors
23Kernel PCA
- Replace all dot-products in PCA with kernel
evaluations. - Recall, could do PCA on DxD covariance matrix of
data - or NxN Gram matrix of data
- For nonlinearity, do PCA on feature expansions
- Instead of doing explicit feature expansion, use
kernel - I.e. d-th order polynomial
- As usual, kernel must satisfy Mercers theorem
- Assume, for simplicity, all
- feature data is zero-mean
Evals Evecs satisfy
If data is zero-mean
24Kernel PCA
- Efficiently find use eigenvectors of C-bar
- Can dot either side of above equation with
feature vector - Eigenvectors are in span of feature vectors
- Combine equations
25Kernel PCA
- From before, we had
- this is an eig equation!
- Get eigenvectors a and eigenvalues of K
- Eigenvalues are N times l
- For each eigenvector ak there is an eigenvector
vk - Want eigenvectors v to be normalized
- Can now use alphas only
- for doing PCA projection
- reconstruction!
26Kernel PCA
- To compute kth projection coefficient of a new
point f(x) - Reconstruction
- Pre-image problem, linear combo in Hilbert
goes outside - Can now do nonlinear PCA and do PCA on
non-vectors - Nonlinear KPCA eigenvectors satisfy
- same properties as usual PCA but
- in Hilbert space. These evecs
- 1) Top q have max variance
- 2) Top q reconstruction has
- with min mean square error
- 3) Are uncorrelated/orthogonal
- 4) Top have max mutual with inputs
27Centering Kernel PCA
- So far, we had assumed the
- data was zero-mean
- We want this
- How to do without touching feature space? Use
kernels - Can get alpha eigenvectors from K tilde by
adjusting old K
282D KPCA
- KPCA on 2d
- dataset
- Left-to-right
- Kernel poly
- order goes
- from 1 to 3
- 1linearPCA
- Top-to-bottom
- top evec
- to weaker
- evecs
29Kernel PCA Results
- Use coefficients of the KPCA for training a
linear SVM - classifier to recognize chairs from their
images. - Use various polynomial kernel
- degrees where 1linear as in
- regular PCA
30Kernel PCA Results
- Use coefficients of the KPCA for training a
linear SVM - classifier to recognize characters from their
images. - Use various polynomial kernel degrees where
1linear as - in regular PCA (worst case in experiments)
- Inferior performance to nonlinear SVMs (why??)
31Spectral Clustering
- Typically, use EM or k-means to cluster N data
points - Can imagine clustering the data points only from
- NxN matrix capturing their proximity
information - This is spectral clustering
- Again compute Gram matrix using, e.g. RBF kernel
- Example have N pixels from an image, each
- x xcoord, ycoord, intensity of each pixel
- From eigenvectors of K matrix (or slight,
- variant), these seem to capture some
- segmentation or clustering of data points!
- Nonparametric form of clustering since we
- didnt assume Gaussian distribution
32Stability in Spectral Clustering
- Standard problem when computing using
eigenvectors - Small changes in data can cause
- eigenvectors to change wildly
- Ensure the eigenvectors we keep are
- distinct stable look at eigengap
- Some algorithms ensure the eigenvectors
- are going to have a safe eigengap. Adjust
- or process Gram matrix to ensure
- eigenvectors are still stable.
3 evecsunsafe
3 evecssafe
gap
33Stabilized Spectral Clustering
- Stabilized spectral clustering algorithm
34Stabilized Spectral Clustering
- Example results compared to other clustering
- algorithms (traditional kmeans, unstable
spectral - clustering, connected components).