Principal Component Analysis for Feature Reduction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Principal Component Analysis for Feature Reduction

1
Principal Component Analysis for Feature
Reduction

Jieping Ye
Department of Computer Science and Engineering
Arizona State University
http//www.public.asu.edu/jye02

2
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis (PCA)
Nonlinear PCA using Kernels

3
Review
Semi-supervised learning
Supervised classification
Unsupervised clustering

Unsupervised clustering
Group similar objects together to find clusters
Supervised classification
Predict class labels of unseen objects based on
training data
Semi-supervised learning
Combines labeled and unlabeled data during
training to improve performance

4
Important Characteristics of Data

Dimensionality
Curse of Dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Noise
Missing values

5
Data Preprocessing

Aggregation Combining two or more attributes (or
objects) into a single attribute (or object)
Sampling using a sample will work almost as well
as using the entire data sets, if the sample is
representative
Discretization and Binarization
Feature creation Feature reduction and Feature
subset selection

Source Huan Lius tutorial
6
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels

7
What is feature reduction?

Feature reduction refers to the mapping of the
original high-dimensional data onto a
lower-dimensional space.
Criterion for feature reduction can be different
based on different problem settings.
Unsupervised setting minimize the information
loss
Supervised setting maximize the class
discrimination
Given a set of data points of p variables
Compute the linear transformation
(projection)

8
What is feature reduction?
Original data
reduced data
Linear transformation
9
High-dimensional data
Gene expression
Face images
Handwritten digits
10
Feature reduction versus feature selection

Feature reduction
All original features are used
The transformed features are linear combinations
of the original features.
Feature selection
Only a subset of the original features are used.
Continuous versus discrete

11
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels

12
Why feature reduction?

Most machine learning and data mining techniques
may not be effective for high-dimensional data
Curse of Dimensionality
Query accuracy and efficiency degrade rapidly as
the dimension increases.
The intrinsic dimension may be small.
For example, the number of genes responsible for
a certain type of disease may be small.

13
Why feature reduction?

Visualization projection of high-dimensional
data onto 2D or 3D.
Data compression efficient storage and
retrieval.
Noise removal positive effect on query accuracy.

14
Application of feature reduction

Face recognition
Handwritten digit recognition
Text mining
Image retrieval
Microarray data analysis
Protein classification

15
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels

16
Feature reduction algorithms

Unsupervised
Latent Semantic Indexing (LSI) truncated SVD
Independent Component Analysis (ICA)
Principal Component Analysis (PCA)
Canonical Correlation Analysis (CCA)
Supervised
Linear Discriminant Analysis (LDA)
Semi-supervised
Research topic

17
Feature reduction algorithms

Linear
Latent Semantic Indexing (LSI) truncated SVD
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Canonical Correlation Analysis (CCA)
Nonlinear
Nonlinear feature reduction using kernels
Manifold learning

18
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels

19
What is Principal Component Analysis?

Principal component analysis (PCA)
Reduce the dimensionality of a data set by
finding a new set of variables, smaller than the
original set of variables
Retains most of the sample's information.
Useful for the compression and classification of
data.
By information we mean the variation present in
the sample, given by the correlations between the
original variables.
The new variables, called principal components
(PCs), are uncorrelated, and are ordered by the
fraction of the total information each retains.

20
Geometric picture of principal components (PCs)

the 1st PC is a minimum distance fit to
a line in X space

the 2nd PC is a minimum distance fit to a
line in the plane perpendicular to the 1st
PC

PCs are a series of linear least squares fits to
a sample, each orthogonal to all the previous.
21
Algebraic definition of PCs
Given a sample of n observations on a vector of p
variables
define the first principal component of the
sample by the linear transformation
where the vector
is chosen such that
is maximum.
22
Algebraic derivation of PCs
To find first note that
where
is the covariance matrix.
In the following, we assume the Data is centered.
23
Algebraic derivation of PCs
Assume
Form the matrix
then
Obtain eigenvectors of S by computing the SVD of
X
(HW1 1)
24
Algebraic derivation of PCs
To find that maximizes
subject to
Let ? be a Lagrange multiplier
is an eigenvector of S
therefore
corresponding to the largest eigenvalue
25
Algebraic derivation of PCs
To find the next coefficient vector
maximizing
uncorrelated
subject to
and to
First note that
then let ? and f be Lagrange multipliers, and
maximize
26
Algebraic derivation of PCs
27
Algebraic derivation of PCs
We find that is also an eigenvector of
S
whose eigenvalue is the second
largest.
In general

The kth largest eigenvalue of S is the
variance of the kth PC.

The kth PC retains the kth greatest
fraction of the variation
in the sample.

28
Algebraic derivation of PCs

Main steps for computing PCs
Form the covariance matrix S.
Compute its eigenvectors
Use the first d eigenvectors to
form the d PCs.
The transformation G is given by

29
Optimality property of PCA
Reconstruction
Dimension reduction
Original data
30
Optimality property of PCA
Main theoretical result
The matrix G consisting of the first d
eigenvectors of the covariance matrix S solves
the following min problem
reconstruction error
PCA projection minimizes the reconstruction error
among all linear projections of size d.
31
Applications of PCA

Eigenfaces for recognition. Turk and Pentland.
1991.
Principal Component Analysis for clustering gene
expression data. Yeung and Ruzzo. 2001.
Probabilistic Disease Classification of
Expression-Dependent Proteomic Data from Mass
Spectrometry of Human Serum. Lilien. 2003.

32
PCA for image compression
d1
d2
d4
d8
Original Image
d16
d32
d64
d100
33
Outline of lecture

Review
What is feature reduction?
Why feature reduction?
Feature reduction algorithms
Principal Component Analysis
Nonlinear PCA using Kernels

34
Motivation
Linear projections will not detect the pattern.
35
Nonlinear PCA using Kernels

Traditional PCA applies linear transformation
May not be effective for nonlinear data
Solution apply nonlinear transformation to
potentially very high-dimensional space.
Computational efficiency apply the kernel trick.
Require PCA can be rewritten in terms of dot
product.

More on kernels later
36
Nonlinear PCA using Kernels

Rewrite PCA in terms of dot product

The covariance matrix S can be written as
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
Eigenvectors of S lie in the space spanned by all
data points.
37
Nonlinear PCA using Kernels
The covariance matrix can be written in matrix
form
Any benefits?
38
Nonlinear PCA using Kernels

Next consider the feature space

The (i,j)-th entry of
is
Apply the kernel trick
K is called the kernel matrix.
39
Nonlinear PCA using Kernels

Projection of a test point x onto v

Explicit mapping is not required here.
40
Discussions

Relationship between PCA and factor analysis
Probabilistic PCA
Mixtures of probabilistic PCA
Reference Tipping and Bishop. Probabilistic
Principal Component Analysis.

Observation variable
latent (unobserved) variable
41
Reference

Principal Component Analysis. I.T. Jolliffe.
Kernel Principal Component Analysis. Schölkopf,
et al.
Geometric Methods for Feature Extraction and
Dimensional Reduction. Burges.

42
Next class

Topics
Review of Linear Discriminant Analysis
Canonical Correlation Analysis
Readings
Canonical correlation analysis An overview with
application to learning methods
http//citeseer.ist.psu.edu/583962.html
The Geometry Of Kernel Canonical Correlation
Analysis
http//www.kyb.tuebingen.mpg.de/publications/pdfs/
pdf2233.pdf

Write a Comment

User Comments (0)

About PowerShow.com

Principal Component Analysis for Feature Reduction PowerPoint PPT Presentation