Title: Principal Component Analysis for Feature Reduction
1Principal Component Analysis for Feature
Reduction
- Jieping Ye
- Department of Computer Science and Engineering
- Arizona State University
- http//www.public.asu.edu/jye02
2Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis (PCA)
- Nonlinear PCA using Kernels
3Review
Semi-supervised learning
Supervised classification
Unsupervised clustering
- Unsupervised clustering
- Group similar objects together to find clusters
- Supervised classification
- Predict class labels of unseen objects based on
training data - Semi-supervised learning
- Combines labeled and unlabeled data during
training to improve performance
4Important Characteristics of Data
- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale
- Noise
- Missing values
5Data Preprocessing
- Aggregation Combining two or more attributes (or
objects) into a single attribute (or object) - Sampling using a sample will work almost as well
as using the entire data sets, if the sample is
representative - Discretization and Binarization
- Feature creation Feature reduction and Feature
subset selection
Source Huan Lius tutorial
6Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis
- Nonlinear PCA using Kernels
7What is feature reduction?
- Feature reduction refers to the mapping of the
original high-dimensional data onto a
lower-dimensional space. - Criterion for feature reduction can be different
based on different problem settings. - Unsupervised setting minimize the information
loss - Supervised setting maximize the class
discrimination - Given a set of data points of p variables
- Compute the linear transformation
(projection)
8What is feature reduction?
Original data
reduced data
Linear transformation
9High-dimensional data
Gene expression
Face images
Handwritten digits
10Feature reduction versus feature selection
- Feature reduction
- All original features are used
- The transformed features are linear combinations
of the original features. - Feature selection
- Only a subset of the original features are used.
- Continuous versus discrete
11Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis
- Nonlinear PCA using Kernels
12Why feature reduction?
- Most machine learning and data mining techniques
may not be effective for high-dimensional data - Curse of Dimensionality
- Query accuracy and efficiency degrade rapidly as
the dimension increases. - The intrinsic dimension may be small.
- For example, the number of genes responsible for
a certain type of disease may be small.
13Why feature reduction?
- Visualization projection of high-dimensional
data onto 2D or 3D. - Data compression efficient storage and
retrieval. - Noise removal positive effect on query accuracy.
14Application of feature reduction
- Face recognition
- Handwritten digit recognition
- Text mining
- Image retrieval
- Microarray data analysis
- Protein classification
15Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis
- Nonlinear PCA using Kernels
16Feature reduction algorithms
- Unsupervised
- Latent Semantic Indexing (LSI) truncated SVD
- Independent Component Analysis (ICA)
- Principal Component Analysis (PCA)
- Canonical Correlation Analysis (CCA)
- Supervised
- Linear Discriminant Analysis (LDA)
- Semi-supervised
- Research topic
17Feature reduction algorithms
- Linear
- Latent Semantic Indexing (LSI) truncated SVD
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Canonical Correlation Analysis (CCA)
- Nonlinear
- Nonlinear feature reduction using kernels
- Manifold learning
18Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis
- Nonlinear PCA using Kernels
19What is Principal Component Analysis?
- Principal component analysis (PCA)
- Reduce the dimensionality of a data set by
finding a new set of variables, smaller than the
original set of variables - Retains most of the sample's information.
- Useful for the compression and classification of
data. - By information we mean the variation present in
the sample, given by the correlations between the
original variables. - The new variables, called principal components
(PCs), are uncorrelated, and are ordered by the
fraction of the total information each retains.
20Geometric picture of principal components (PCs)
- the 1st PC is a minimum distance fit to
a line in X space
- the 2nd PC is a minimum distance fit to a
line in the plane perpendicular to the 1st
PC
PCs are a series of linear least squares fits to
a sample, each orthogonal to all the previous.
21Algebraic definition of PCs
Given a sample of n observations on a vector of p
variables
define the first principal component of the
sample by the linear transformation
where the vector
is chosen such that
is maximum.
22Algebraic derivation of PCs
To find first note that
where
is the covariance matrix.
In the following, we assume the Data is centered.
23Algebraic derivation of PCs
Assume
Form the matrix
then
Obtain eigenvectors of S by computing the SVD of
X
(HW1 1)
24Algebraic derivation of PCs
To find that maximizes
subject to
Let ? be a Lagrange multiplier
is an eigenvector of S
therefore
corresponding to the largest eigenvalue
25Algebraic derivation of PCs
To find the next coefficient vector
maximizing
uncorrelated
subject to
and to
First note that
then let ? and f be Lagrange multipliers, and
maximize
26Algebraic derivation of PCs
27Algebraic derivation of PCs
We find that is also an eigenvector of
S
whose eigenvalue is the second
largest.
In general
- The kth largest eigenvalue of S is the
variance of the kth PC.
- The kth PC retains the kth greatest
fraction of the variation - in the sample.
28Algebraic derivation of PCs
- Main steps for computing PCs
- Form the covariance matrix S.
- Compute its eigenvectors
- Use the first d eigenvectors to
form the d PCs. - The transformation G is given by
29Optimality property of PCA
Reconstruction
Dimension reduction
Original data
30Optimality property of PCA
Main theoretical result
The matrix G consisting of the first d
eigenvectors of the covariance matrix S solves
the following min problem
reconstruction error
PCA projection minimizes the reconstruction error
among all linear projections of size d.
31Applications of PCA
- Eigenfaces for recognition. Turk and Pentland.
1991. - Principal Component Analysis for clustering gene
expression data. Yeung and Ruzzo. 2001. - Probabilistic Disease Classification of
Expression-Dependent Proteomic Data from Mass
Spectrometry of Human Serum. Lilien. 2003.
32PCA for image compression
d1
d2
d4
d8
Original Image
d16
d32
d64
d100
33Outline of lecture
- Review
- What is feature reduction?
- Why feature reduction?
- Feature reduction algorithms
- Principal Component Analysis
- Nonlinear PCA using Kernels
34Motivation
Linear projections will not detect the pattern.
35Nonlinear PCA using Kernels
- Traditional PCA applies linear transformation
- May not be effective for nonlinear data
- Solution apply nonlinear transformation to
potentially very high-dimensional space. - Computational efficiency apply the kernel trick.
- Require PCA can be rewritten in terms of dot
product.
More on kernels later
36Nonlinear PCA using Kernels
- Rewrite PCA in terms of dot product
The covariance matrix S can be written as
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
Eigenvectors of S lie in the space spanned by all
data points.
37Nonlinear PCA using Kernels
The covariance matrix can be written in matrix
form
Any benefits?
38Nonlinear PCA using Kernels
- Next consider the feature space
The (i,j)-th entry of
is
Apply the kernel trick
K is called the kernel matrix.
39Nonlinear PCA using Kernels
- Projection of a test point x onto v
Explicit mapping is not required here.
40Discussions
- Relationship between PCA and factor analysis
- Probabilistic PCA
- Mixtures of probabilistic PCA
- Reference Tipping and Bishop. Probabilistic
Principal Component Analysis.
Observation variable
latent (unobserved) variable
41Reference
- Principal Component Analysis. I.T. Jolliffe.
- Kernel Principal Component Analysis. Schölkopf,
et al. - Geometric Methods for Feature Extraction and
Dimensional Reduction. Burges.
42Next class
- Topics
- Review of Linear Discriminant Analysis
- Canonical Correlation Analysis
- Readings
- Canonical correlation analysis An overview with
application to learning methods - http//citeseer.ist.psu.edu/583962.html
- The Geometry Of Kernel Canonical Correlation
Analysis - http//www.kyb.tuebingen.mpg.de/publications/pdfs/
pdf2233.pdf