Advanced%20Machine%20Learning%20 - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced%20Machine%20Learning%20

Description:

Advanced Machine Learning & Perception Instructor: Tony Jebara – PowerPoint PPT presentation

Number of Views:260
Avg rating:3.0/5.0
Slides: 35
Provided by: jeb93
Category:

less

Transcript and Presenter's Notes

Title: Advanced%20Machine%20Learning%20


1
Advanced Machine Learning Perception
Instructor Tony Jebara
2
Topic 13
  • Manifolds Continued and Spectral Clustering
  • Convex Invariance Learning (CoIL)
  • Kernel PCA (KPCA)
  • Spectral Clustering N-Cuts

3
Manifolds Continued
  • PCA linear manifold
  • MDS get inter-point distances, find 2D data with
    same
  • LLE mimic neighborhoods using low dimensional
    vectors
  • GTM fit a grid of Gaussians to data via
    nonlinear warp
  • Linear PCA after Nonlinear normalization/invarianc
    e of data
  • Manifold is Linear PCA in Hilbert space (Kernels)
  • Spectral Clustering in Hilbert space

4
Convex Invariance Learning
  • PCA is appropriate for finding a linear manifold
  • Variation in data is only modeled linearly
  • But, many problems are nonlinear
  • However, the nonlinear variations may be
    irrelevant
  • Images morph, rotate, translate, zoom
  • Audio pitch changes, ambient acoustics
  • Video motion, camera view, angles
  • Genomics proteins fold, insertions,
    deletions
  • Databases fields swapped, formats, scaled
  • Imagine a Gremlin is corrupting your data by
    multiplying
  • each input vector Xt by a type of matrix At to
    give AtXt
  • Idea remove nonlinear irrelevant variations
    before PCA
  • But, make this part of PCA optimization, not
    pre-processing

5
Convex Invariance Learning
  • Example of irrelevant variation in our data
  • permutation in image data each image Xt is
    multiplied
  • by a permutation matrix At by gremlin. Must
    clean it.
  • When we convert images to a vector, we are
    assuming
  • arbitrary meaningless ordering (like Gremlin
    mixing order)
  • This arbitrary ordering causes wild
    nonlinearities (manifold)
  • We should not trust ordering, assume gremlin has
  • permuted it with arbitrary permutation matrix

6
Permutation Invariance
  • Permutation is irrelevant variation in our data.
  • Gremlin is permuting fields in our input vectors
  • So, view a datum as Bag of Vectors instead
    single vector
  • i.e. grayscale image Set of Vectors or Bag
    of Pixels
  • N pixels, each is a D3 XYI tuple
  • matrix Ai by gremlin. Must clean it.
  • Treat each input as permutable Bag of Pixels

x
x
x
7
Optimal Permutation
  • Vectorization / Rasterization uses index in
    image
  • to sort pixels into large vector.
  • If we knew optimal correspondence could fix
  • sorting pixels in bag into large vector more
  • appropriately
  • we dont know it, must learn it

8
PCA on Permutated Data
  • In non-permuted vector images, linear changes
    eigenvectors are additions deletions of
    intensities (bad!). Translating, raising
    eyebrows, etc. erasing redrawing
  • In bag of pixels (vectorized only after knowing
    optimal permutation) get linear changes
    eigenvectors are morphings, warpings, jointly
    spatial intensity change

9
Permutation as a Manifold
  • Assume order unknown. Set of Vectors or Bag of
    Pixels
  • Get permutational invariance (order doesnt
    matter)
  • Cant represent invariance by single X vector
    point in DxN
  • space since we dont know the ordering
  • Get permutation invariance by X spanning all
    possible
  • reorderings. Multiply X by unknown A matrix
    (permutation
  • or doubly-stochastic)

x
x
x
x
x
10
Invariant Paths as Matrix Ops
  • Move vector along manifold by multiplying by
    matrix
  • Restrict A to be permutation matrix (operator)
  • Resulting manifold of configurations is orbit
    if A is group
  • Or, for smooth manifold, make A doubly-stochastic
    matrix
  • Endow each image in dataset with own
    transformation matrix
    . Each is now a bag or manifold

11
A Dataset of Invariant Manifolds
  • E.g. assume model is PCA, learn 2D subspace of 3D
    data
  • Permutation lets points move independently along
    paths
  • Find PCA after moving to form tight 2D subspace
  • More generally, move along manifolds to improve
    fit of any
  • model (PCA, SVM, probability density, etc.)

12
Optimizing the Permutations
  • Optimize modeling cost linear constraints on
    matrices
  • Estimate transformation
    parameters
  • and model parameters (PCA, Gaussian,
    SVM)
  • Cost on matrices A emerges from modeling
    criterion
  • Typically, get a Convex Cost with Convex Hull of
  • Constraints (Unique!)
  • Since A matrices are soft permutation
  • matrices (doubly-stochastic) we have

13
Example Cost Gaussian Mean
  • Maximum Likelihood Gaussian Mean Model
  • Theorem 1 C(A) is convex in A (Convex Program)
  • Can solve via a quadratic program on the A
    matrices
  • Minimizing the trace of a covariance tries to
    pull the data spherically towards a common mean

14
Example Cost Gaussian Cov
  • Theorem 2 Regularized log determinant of
    covariance is
  • convex. Equivalently, minimize
  • Theorem 3 Cost non-quadratic but upper boundable
    by
  • quad. Iteratively solve with QP with
    variational bound
  • Mining determinant flattens data into low volume
    pancake

15
Example Cost Fisher Discrimin.
  • Find linear Fisher Discriminant model w that
  • maximizes ratio of between within-class
    scatter
  • For discriminative invariance, transformation
    matrices
  • should increase between-class scatter
    (numerator) and
  • should reduce within class scatter
    (denominator)
  • Minimizing above permutes data to make
    classification easy

x
x
x
x
x
x
x
x
x
x
16
Interpreting C(A)
  • Maximum Likelihood Mean
  • Permute data towards common mean
  • Maximum Likelihood Mean Covariance
  • Permute data towards flat subspace
  • Pushes energy into few eigenvectors
  • Great as pre-processing before PCA
  • Fisher Discriminant
  • Permute data towards two flat
  • subspaces while repelling away
  • from each others means

17
SMO Optimization of QP
  • Quadratic Programming used for all C(A) since
  • Gaussian Mean quadratic
  • Gaussian Covariance upper boundable by quadratic
  • Fisher Discriminant upper boundable by quadratic
  • Use Sequential Minimal Optimization
  • axis parallel optimization, pick axes to update,
  • ensure constraints not violated
  • Soft permutation matrix 4 constraints
  • or 4 entries at a time

18
XY Digits Permuted PCA
20 Images of 3 and 9 Each is 70 (x,y) dots No
order on the dots PCA compress with
same number of Eigenvectors Convex Program
first estimates the permutation ? better
reconstruction
Original
PCA
Permuted PCA
19
Interpolation
Intermediate images are smooth morphs Points
nicely corresponded Spatial morphing versus
redrawing No ghosting
20
XYI Faces Permuted PCA
Original PCA Permuted
Bag-of XYI Pixels
PCA
2000 XYI Pixels Compress to 20 dims Improve
squared error of PCA by Almost 3 orders of
magnitude x103
21
XYI Multi-Faces Permuted PCA
/- Scaling on Eigenvector
Top 5 Eigenvectors All just linear variations
in bag of XYI pixels Vectorization nonlinear need
s huge of eigenvectors
22
XYI Multi-Faces Permuted PCA
/- Scaling on Eigenvector
Next 5 Eigenvectors
23
Kernel PCA
  • Replace all dot-products in PCA with kernel
    evaluations.
  • Recall, could do PCA on DxD covariance matrix of
    data
  • or NxN Gram matrix of data
  • For nonlinearity, do PCA on feature expansions
  • Instead of doing explicit feature expansion, use
    kernel
  • I.e. d-th order polynomial
  • As usual, kernel must satisfy Mercers theorem
  • Assume, for simplicity, all
  • feature data is zero-mean

Evals Evecs satisfy
If data is zero-mean
24
Kernel PCA
  • Efficiently find use eigenvectors of C-bar
  • Can dot either side of above equation with
    feature vector
  • Eigenvectors are in span of feature vectors
  • Combine equations

25
Kernel PCA
  • From before, we had
  • this is an eig equation!
  • Get eigenvectors a and eigenvalues of K
  • Eigenvalues are N times l
  • For each eigenvector ak there is an eigenvector
    vk
  • Want eigenvectors v to be normalized
  • Can now use alphas only
  • for doing PCA projection
  • reconstruction!

26
Kernel PCA
  • To compute kth projection coefficient of a new
    point f(x)
  • Reconstruction
  • Pre-image problem, linear combo in Hilbert
    goes outside
  • Can now do nonlinear PCA and do PCA on
    non-vectors
  • Nonlinear KPCA eigenvectors satisfy
  • same properties as usual PCA but
  • in Hilbert space. These evecs
  • 1) Top q have max variance
  • 2) Top q reconstruction has
  • with min mean square error
  • 3) Are uncorrelated/orthogonal
  • 4) Top have max mutual with inputs

27
Centering Kernel PCA
  • So far, we had assumed the
  • data was zero-mean
  • We want this
  • How to do without touching feature space? Use
    kernels
  • Can get alpha eigenvectors from K tilde by
    adjusting old K

28
2D KPCA
  • KPCA on 2d
  • dataset
  • Left-to-right
  • Kernel poly
  • order goes
  • from 1 to 3
  • 1linearPCA
  • Top-to-bottom
  • top evec
  • to weaker
  • evecs

29
Kernel PCA Results
  • Use coefficients of the KPCA for training a
    linear SVM
  • classifier to recognize chairs from their
    images.
  • Use various polynomial kernel
  • degrees where 1linear as in
  • regular PCA

30
Kernel PCA Results
  • Use coefficients of the KPCA for training a
    linear SVM
  • classifier to recognize characters from their
    images.
  • Use various polynomial kernel degrees where
    1linear as
  • in regular PCA (worst case in experiments)
  • Inferior performance to nonlinear SVMs (why??)

31
Spectral Clustering
  • Typically, use EM or k-means to cluster N data
    points
  • Can imagine clustering the data points only from
  • NxN matrix capturing their proximity
    information
  • This is spectral clustering
  • Again compute Gram matrix using, e.g. RBF kernel
  • Example have N pixels from an image, each
  • x xcoord, ycoord, intensity of each pixel
  • From eigenvectors of K matrix (or slight,
  • variant), these seem to capture some
  • segmentation or clustering of data points!
  • Nonparametric form of clustering since we
  • didnt assume Gaussian distribution

32
Stability in Spectral Clustering
  • Standard problem when computing using
    eigenvectors
  • Small changes in data can cause
  • eigenvectors to change wildly
  • Ensure the eigenvectors we keep are
  • distinct stable look at eigengap
  • Some algorithms ensure the eigenvectors
  • are going to have a safe eigengap. Adjust
  • or process Gram matrix to ensure
  • eigenvectors are still stable.

3 evecsunsafe
3 evecssafe
gap
33
Stabilized Spectral Clustering
  • Stabilized spectral clustering algorithm

34
Stabilized Spectral Clustering
  • Example results compared to other clustering
  • algorithms (traditional kmeans, unstable
    spectral
  • clustering, connected components).
Write a Comment
User Comments (0)
About PowerShow.com