Title: Yang Dai
1Computational Functional GenomicsLecture
15Genomic data-mining method 3 - dimension
reduction in unsupervised learning
2Introduction
- Problems with high dimensional data the curse
of dimensionality - running time
- A lot of methods have at least O(nd2) complexity
(running time), where n is the number of samples - Overfitting
- Number of samples required
- Dimensionality reduction methods
- Principal Component Analysis (PCA) for
unsupervised learning - Fisher Linear Discriminant (FLD) for supervised
learning
3Principal Component Analysis - 1
- Main idea seek a projection on a lower
dimensional space that best presents the data in
the sense of least-square errors - After the data is projected on the best line,
need to transform the coordinate system to get 1D
representation for vector y - New data y has the same variance as old data x in
the direction of the green line - PCA preserves largest variances in the data
Large projection errors, bad line to project to
Small projection errors, good line to project to
4Principal Component Analysis - 2
- Consider a set of d-dimensional points,
x1,...,xn. We want an one-dimensional
representation by projecting the data onto a line
running through the mean of the points.
5Principal Component Analysis - 3
- If we present each xi by ,
then we want to find an optimal set of
coefficients ai, (i1,,n) and the direction w
that minimize the least-sum-of squared error
criterion function
6Principal Component Analysis - 4
7Principal Component Analysis - 5
- We consider
- Solve this problem by Lagrange multiply method
- Taking ? as the largest eigenvalue of the scatter
matrix S will maximize the objective function.
The optimal solution w is the eigenvector
corresponding to the largest eigenvalue ?. - To find the best one-dimensional projection (in
the sense of least sum-of-squared error) we
project the data onto a line through the sample
mean in the direction of the eigenvector
corresponding to the largest eigenvalue of the
scatter matrix.
8Principal Component Analysis - 6
- Extension to an s-dimensional projection (sltd)
- Then
- should be minimized. By generalizing the
result of the one-dimensional case, Js is
minimized when wk (k1,,s) are set as the
eigenvectors corresponding to the first s largest
eigenvalues of the scatter matrix. - wk (k1,,s) form a nature set of basis vectors
for representing any vector x - Although PCA finds components that are useful for
presenting data, there is no reason to assume
that these components must be useful for
discriminating between data in different classes
9Principal Component Analysis - 7
- The variance of the kth coordinates
is proportional to the kth
eigenvalue of S - First, the mean
is zero - The variance of the kth coordinates
-
- Therefore the PCA is to find the components C1,
C2,Cs so that they explain the maximum amount of
variance possible by s linearly transformed
components.
10Principal Component Analysis - 8
- The extraction of principal components amounts to
a variance maximizing rotation of the original
variable space. - For example, in a scatter plot we can think of
the regression line as the original X axis,
rotated so that it approximates the regression
line. - This type of rotation is called variance
maximizing because the criterion for the rotation
is to maximize the variance (variability) of the
"new" variable (factor), while minimizing the
variance around the new variable.
11Principal Component Analysis - 9
- The basic goal in PCA is to reduce the dimension
of the data. Thus one usually chooses sltltd. - Such a reduction in dimension has important
benefits - the computational overhead of the subsequent
processing stages is reduced - noise may be reduced, as the data not contained
in the first few components may be mostly due to
noise - a projection into a subspace of a very low
dimension, for example two, is useful for
visualizing the data. - It is not necessary to use the s principal
components themselves
12Principal Component Analysis- Example
- Principal component (PC1)
- the direction along which there is greatest
variation - Principal component (PC2)
- the direction with maximum variation left in
data, orthogonal to the first PC
13PCA on all GenesLeukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
14PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients
reduced to 2
15Drawbacks of Principal Component Analysis
- PCA was designed for accurate data
representation, not for data classification - Preserves as much variance in data as possible
- If directions of maximum variance is important
for classification, PCA will work - However the directions of maximum variance may be
useless for classification
separable
not separable
Apply PCA to
each class
16Principal Component Analysis in R
- MASS package includes a dataset derived from
Campbell, N.A. and Mahon, R.J. (1974) A
multivariate study of variation in two species of
rock crab of genus Leptograpsus. Australian
Journal of Zoology, 22, 417-425. - R code
- gt library(MASS)
- gt data(crabs)
- gt pp lt- prcomp(crabs, -c(1, 2, 3))
- gt pp
- plot first two principal components
- gt plot(ppx, 1, ppx, 2, col
ifelse(crabssp "O", "orange", - "blue"), pch 16, xlab "PC.1", ylab
"PC.2") - gt attach(crabs)
- gt library(scatterplot3d)
- plot first three principal components in 3D
- gt scatterplot3d(ppx, 1, ppx, 2, ppx, 3,
color ifelse(crabssp - "O", "orange", "blue"), pch ifelse(crabssex
"M", 1, 15)) - plot all principal components pairwise
- gt pairs(ppx, col ifelse(crabssp "O",
"orange", "blue"), - pch ifelse(crabssex "M", 1, 16))
17Principal Component Analysis in R
Principal components analysis of crab data first
two principal components
First 3 PC in a 3D display
All PC