Yang Dai - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Yang Dai

Description:

Genomic data-mining method 3 - dimension reduction in unsupervised learning ... Plot of 34 patients, dimension of 8973 genes reduced to 2 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 18

Provided by: Lei146

Category:

more less

Transcript and Presenter's Notes

Title: Yang Dai

1
Computational Functional GenomicsLecture
15Genomic data-mining method 3 - dimension
reduction in unsupervised learning

Yang Dai
BioE 594

2
Introduction

Problems with high dimensional data the curse
of dimensionality
running time
A lot of methods have at least O(nd2) complexity
(running time), where n is the number of samples
Overfitting
Number of samples required
Dimensionality reduction methods
Principal Component Analysis (PCA) for
unsupervised learning
Fisher Linear Discriminant (FLD) for supervised
learning

3
Principal Component Analysis - 1

Main idea seek a projection on a lower
dimensional space that best presents the data in
the sense of least-square errors
After the data is projected on the best line,
need to transform the coordinate system to get 1D
representation for vector y
New data y has the same variance as old data x in
the direction of the green line
PCA preserves largest variances in the data

Large projection errors, bad line to project to
Small projection errors, good line to project to
4
Principal Component Analysis - 2

Consider a set of d-dimensional points,
x1,...,xn. We want an one-dimensional
representation by projecting the data onto a line
running through the mean of the points.

5
Principal Component Analysis - 3

If we present each xi by ,
then we want to find an optimal set of
coefficients ai, (i1,,n) and the direction w
that minimize the least-sum-of squared error
criterion function

6
Principal Component Analysis - 4
7
Principal Component Analysis - 5

We consider
Solve this problem by Lagrange multiply method
Taking ? as the largest eigenvalue of the scatter
matrix S will maximize the objective function.
The optimal solution w is the eigenvector
corresponding to the largest eigenvalue ?.
To find the best one-dimensional projection (in
the sense of least sum-of-squared error) we
project the data onto a line through the sample
mean in the direction of the eigenvector
corresponding to the largest eigenvalue of the
scatter matrix.

8
Principal Component Analysis - 6

Extension to an s-dimensional projection (sltd)
Then
should be minimized. By generalizing the
result of the one-dimensional case, Js is
minimized when wk (k1,,s) are set as the
eigenvectors corresponding to the first s largest
eigenvalues of the scatter matrix.
wk (k1,,s) form a nature set of basis vectors
for representing any vector x
Although PCA finds components that are useful for
presenting data, there is no reason to assume
that these components must be useful for
discriminating between data in different classes

9
Principal Component Analysis - 7

The variance of the kth coordinates
is proportional to the kth
eigenvalue of S
First, the mean
is zero
The variance of the kth coordinates
Therefore the PCA is to find the components C1,
C2,Cs so that they explain the maximum amount of
variance possible by s linearly transformed
components.

10
Principal Component Analysis - 8

The extraction of principal components amounts to
a variance maximizing rotation of the original
variable space.
For example, in a scatter plot we can think of
the regression line as the original X axis,
rotated so that it approximates the regression
line.
This type of rotation is called variance
maximizing because the criterion for the rotation
is to maximize the variance (variability) of the
"new" variable (factor), while minimizing the
variance around the new variable.

11
Principal Component Analysis - 9

The basic goal in PCA is to reduce the dimension
of the data. Thus one usually chooses sltltd.
Such a reduction in dimension has important
benefits
the computational overhead of the subsequent
processing stages is reduced
noise may be reduced, as the data not contained
in the first few components may be mostly due to
noise
a projection into a subspace of a very low
dimension, for example two, is useful for
visualizing the data.
It is not necessary to use the s principal
components themselves

12
Principal Component Analysis- Example

Principal component (PC1)
the direction along which there is greatest
variation
Principal component (PC2)
the direction with maximum variation left in
data, orthogonal to the first PC

13
PCA on all GenesLeukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
14
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients
reduced to 2
15
Drawbacks of Principal Component Analysis

PCA was designed for accurate data
representation, not for data classification
Preserves as much variance in data as possible
If directions of maximum variance is important
for classification, PCA will work
However the directions of maximum variance may be
useless for classification

separable
not separable
Apply PCA to
each class
16
Principal Component Analysis in R

MASS package includes a dataset derived from
Campbell, N.A. and Mahon, R.J. (1974) A
multivariate study of variation in two species of
rock crab of genus Leptograpsus. Australian
Journal of Zoology, 22, 417-425.
R code
gt library(MASS)
gt data(crabs)
gt pp lt- prcomp(crabs, -c(1, 2, 3))
gt pp
plot first two principal components
gt plot(ppx, 1, ppx, 2, col
ifelse(crabssp "O", "orange",
"blue"), pch 16, xlab "PC.1", ylab
"PC.2")
gt attach(crabs)
gt library(scatterplot3d)
plot first three principal components in 3D
gt scatterplot3d(ppx, 1, ppx, 2, ppx, 3,
color ifelse(crabssp
"O", "orange", "blue"), pch ifelse(crabssex
"M", 1, 15))
plot all principal components pairwise
gt pairs(ppx, col ifelse(crabssp "O",
"orange", "blue"),
pch ifelse(crabssex "M", 1, 16))