Principal Component Analysis for Feature Reduction - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Principal Component Analysis for Feature Reduction

Description:

Principal Component Analysis for Feature Reduction. Jieping Ye ... Let v be The eigenvector of S corresponding to. nonzero eigenvalue ... – PowerPoint PPT presentation

Number of Views:513
Avg rating:3.0/5.0
Slides: 43
Provided by: publi5
Category:

less

Transcript and Presenter's Notes

Title: Principal Component Analysis for Feature Reduction


1
Principal Component Analysis for Feature
Reduction
  • Jieping Ye
  • Department of Computer Science and Engineering
  • Arizona State University
  • http//www.public.asu.edu/jye02

2
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis (PCA)
  • Nonlinear PCA using Kernels

3
Review
Semi-supervised learning
Supervised classification
Unsupervised clustering
  • Unsupervised clustering
  • Group similar objects together to find clusters
  • Supervised classification
  • Predict class labels of unseen objects based on
    training data
  • Semi-supervised learning
  • Combines labeled and unlabeled data during
    training to improve performance

4
Important Characteristics of Data
  • Dimensionality
  • Curse of Dimensionality
  • Sparsity
  • Only presence counts
  • Resolution
  • Patterns depend on the scale
  • Noise
  • Missing values

5
Data Preprocessing
  • Aggregation Combining two or more attributes (or
    objects) into a single attribute (or object)
  • Sampling using a sample will work almost as well
    as using the entire data sets, if the sample is
    representative
  • Discretization and Binarization
  • Feature creation Feature reduction and Feature
    subset selection

Source Huan Lius tutorial
6
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis
  • Nonlinear PCA using Kernels

7
What is feature reduction?
  • Feature reduction refers to the mapping of the
    original high-dimensional data onto a
    lower-dimensional space.
  • Criterion for feature reduction can be different
    based on different problem settings.
  • Unsupervised setting minimize the information
    loss
  • Supervised setting maximize the class
    discrimination
  • Given a set of data points of p variables
  • Compute the linear transformation
    (projection)

8
What is feature reduction?
Original data
reduced data
Linear transformation
9
High-dimensional data
Gene expression
Face images
Handwritten digits
10
Feature reduction versus feature selection
  • Feature reduction
  • All original features are used
  • The transformed features are linear combinations
    of the original features.
  • Feature selection
  • Only a subset of the original features are used.
  • Continuous versus discrete

11
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis
  • Nonlinear PCA using Kernels

12
Why feature reduction?
  • Most machine learning and data mining techniques
    may not be effective for high-dimensional data
  • Curse of Dimensionality
  • Query accuracy and efficiency degrade rapidly as
    the dimension increases.
  • The intrinsic dimension may be small.
  • For example, the number of genes responsible for
    a certain type of disease may be small.

13
Why feature reduction?
  • Visualization projection of high-dimensional
    data onto 2D or 3D.
  • Data compression efficient storage and
    retrieval.
  • Noise removal positive effect on query accuracy.

14
Application of feature reduction
  • Face recognition
  • Handwritten digit recognition
  • Text mining
  • Image retrieval
  • Microarray data analysis
  • Protein classification

15
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis
  • Nonlinear PCA using Kernels

16
Feature reduction algorithms
  • Unsupervised
  • Latent Semantic Indexing (LSI) truncated SVD
  • Independent Component Analysis (ICA)
  • Principal Component Analysis (PCA)
  • Canonical Correlation Analysis (CCA)
  • Supervised
  • Linear Discriminant Analysis (LDA)
  • Semi-supervised
  • Research topic

17
Feature reduction algorithms
  • Linear
  • Latent Semantic Indexing (LSI) truncated SVD
  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Canonical Correlation Analysis (CCA)
  • Nonlinear
  • Nonlinear feature reduction using kernels
  • Manifold learning

18
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis
  • Nonlinear PCA using Kernels

19
What is Principal Component Analysis?
  • Principal component analysis (PCA)
  • Reduce the dimensionality of a data set by
    finding a new set of variables, smaller than the
    original set of variables
  • Retains most of the sample's information.
  • Useful for the compression and classification of
    data.
  • By information we mean the variation present in
    the sample, given by the correlations between the
    original variables.
  • The new variables, called principal components
    (PCs), are uncorrelated, and are ordered by the
    fraction of the total information each retains.

20
Geometric picture of principal components (PCs)
  • the 1st PC is a minimum distance fit to
    a line in X space
  • the 2nd PC is a minimum distance fit to a
    line in the plane perpendicular to the 1st
    PC

PCs are a series of linear least squares fits to
a sample, each orthogonal to all the previous.
21
Algebraic definition of PCs
Given a sample of n observations on a vector of p
variables
define the first principal component of the
sample by the linear transformation
where the vector
is chosen such that
is maximum.
22
Algebraic derivation of PCs
To find first note that
where
is the covariance matrix.
In the following, we assume the Data is centered.
23
Algebraic derivation of PCs
Assume
Form the matrix
then
Obtain eigenvectors of S by computing the SVD of
X
(HW1 1)
24
Algebraic derivation of PCs
To find that maximizes
subject to
Let ? be a Lagrange multiplier
is an eigenvector of S
therefore
corresponding to the largest eigenvalue
25
Algebraic derivation of PCs
To find the next coefficient vector
maximizing
uncorrelated
subject to
and to
First note that
then let ? and f be Lagrange multipliers, and
maximize
26
Algebraic derivation of PCs
27
Algebraic derivation of PCs
We find that is also an eigenvector of
S
whose eigenvalue is the second
largest.
In general
  • The kth largest eigenvalue of S is the
    variance of the kth PC.
  • The kth PC retains the kth greatest
    fraction of the variation
  • in the sample.

28
Algebraic derivation of PCs
  • Main steps for computing PCs
  • Form the covariance matrix S.
  • Compute its eigenvectors
  • Use the first d eigenvectors to
    form the d PCs.
  • The transformation G is given by

29
Optimality property of PCA
Reconstruction
Dimension reduction
Original data
30
Optimality property of PCA
Main theoretical result
The matrix G consisting of the first d
eigenvectors of the covariance matrix S solves
the following min problem
reconstruction error
PCA projection minimizes the reconstruction error
among all linear projections of size d.
31
Applications of PCA
  • Eigenfaces for recognition. Turk and Pentland.
    1991.
  • Principal Component Analysis for clustering gene
    expression data. Yeung and Ruzzo. 2001.
  • Probabilistic Disease Classification of
    Expression-Dependent Proteomic Data from Mass
    Spectrometry of Human Serum. Lilien. 2003.

32
PCA for image compression
d1
d2
d4
d8
Original Image
d16
d32
d64
d100
33
Outline of lecture
  • Review
  • What is feature reduction?
  • Why feature reduction?
  • Feature reduction algorithms
  • Principal Component Analysis
  • Nonlinear PCA using Kernels

34
Motivation
Linear projections will not detect the pattern.
35
Nonlinear PCA using Kernels
  • Traditional PCA applies linear transformation
  • May not be effective for nonlinear data
  • Solution apply nonlinear transformation to
    potentially very high-dimensional space.
  • Computational efficiency apply the kernel trick.
  • Require PCA can be rewritten in terms of dot
    product.

More on kernels later
36
Nonlinear PCA using Kernels
  • Rewrite PCA in terms of dot product

The covariance matrix S can be written as
Let v be The eigenvector of S corresponding to
nonzero eigenvalue
Eigenvectors of S lie in the space spanned by all
data points.
37
Nonlinear PCA using Kernels
The covariance matrix can be written in matrix
form
Any benefits?
38
Nonlinear PCA using Kernels
  • Next consider the feature space

The (i,j)-th entry of
is
Apply the kernel trick
K is called the kernel matrix.
39
Nonlinear PCA using Kernels
  • Projection of a test point x onto v

Explicit mapping is not required here.
40
Discussions
  • Relationship between PCA and factor analysis
  • Probabilistic PCA
  • Mixtures of probabilistic PCA
  • Reference Tipping and Bishop. Probabilistic
    Principal Component Analysis.

Observation variable
latent (unobserved) variable
41
Reference
  • Principal Component Analysis. I.T. Jolliffe.
  • Kernel Principal Component Analysis. Schölkopf,
    et al.
  • Geometric Methods for Feature Extraction and
    Dimensional Reduction. Burges.

42
Next class
  • Topics
  • Review of Linear Discriminant Analysis
  • Canonical Correlation Analysis
  • Readings
  • Canonical correlation analysis An overview with
    application to learning methods  
  • http//citeseer.ist.psu.edu/583962.html
  • The Geometry Of Kernel Canonical Correlation
    Analysis
  • http//www.kyb.tuebingen.mpg.de/publications/pdfs/
    pdf2233.pdf
Write a Comment
User Comments (0)
About PowerShow.com