Yang Dai - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Yang Dai

Description:

Genomic data-mining method 3 - dimension reduction in unsupervised learning ... Plot of 34 patients, dimension of 8973 genes reduced to 2 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 18
Provided by: Lei146
Category:

less

Transcript and Presenter's Notes

Title: Yang Dai


1
Computational Functional GenomicsLecture
15Genomic data-mining method 3 - dimension
reduction in unsupervised learning
  • Yang Dai
  • BioE 594

2
Introduction
  • Problems with high dimensional data the curse
    of dimensionality
  • running time
  • A lot of methods have at least O(nd2) complexity
    (running time), where n is the number of samples
  • Overfitting
  • Number of samples required
  • Dimensionality reduction methods
  • Principal Component Analysis (PCA) for
    unsupervised learning
  • Fisher Linear Discriminant (FLD) for supervised
    learning

3
Principal Component Analysis - 1
  • Main idea seek a projection on a lower
    dimensional space that best presents the data in
    the sense of least-square errors
  • After the data is projected on the best line,
    need to transform the coordinate system to get 1D
    representation for vector y
  • New data y has the same variance as old data x in
    the direction of the green line
  • PCA preserves largest variances in the data

Large projection errors, bad line to project to
Small projection errors, good line to project to
4
Principal Component Analysis - 2
  • Consider a set of d-dimensional points,
    x1,...,xn. We want an one-dimensional
    representation by projecting the data onto a line
    running through the mean of the points.

5
Principal Component Analysis - 3
  • If we present each xi by ,
    then we want to find an optimal set of
    coefficients ai, (i1,,n) and the direction w
    that minimize the least-sum-of squared error
    criterion function

6
Principal Component Analysis - 4
7
Principal Component Analysis - 5
  • We consider
  • Solve this problem by Lagrange multiply method
  • Taking ? as the largest eigenvalue of the scatter
    matrix S will maximize the objective function.
    The optimal solution w is the eigenvector
    corresponding to the largest eigenvalue ?.
  • To find the best one-dimensional projection (in
    the sense of least sum-of-squared error) we
    project the data onto a line through the sample
    mean in the direction of the eigenvector
    corresponding to the largest eigenvalue of the
    scatter matrix.

8
Principal Component Analysis - 6
  • Extension to an s-dimensional projection (sltd)
  • Then
  • should be minimized. By generalizing the
    result of the one-dimensional case, Js is
    minimized when wk (k1,,s) are set as the
    eigenvectors corresponding to the first s largest
    eigenvalues of the scatter matrix.
  • wk (k1,,s) form a nature set of basis vectors
    for representing any vector x
  • Although PCA finds components that are useful for
    presenting data, there is no reason to assume
    that these components must be useful for
    discriminating between data in different classes

9
Principal Component Analysis - 7
  • The variance of the kth coordinates
    is proportional to the kth
    eigenvalue of S
  • First, the mean
    is zero
  • The variance of the kth coordinates
  • Therefore the PCA is to find the components C1,
    C2,Cs so that they explain the maximum amount of
    variance possible by s linearly transformed
    components.

10
Principal Component Analysis - 8
  • The extraction of principal components amounts to
    a variance maximizing rotation of the original
    variable space.
  • For example, in a scatter plot we can think of
    the regression line as the original X axis,
    rotated so that it approximates the regression
    line.
  • This type of rotation is called variance
    maximizing because the criterion for the rotation
    is to maximize the variance (variability) of the
    "new" variable (factor), while minimizing the
    variance around the new variable.

11
Principal Component Analysis - 9
  • The basic goal in PCA is to reduce the dimension
    of the data. Thus one usually chooses sltltd.
  • Such a reduction in dimension has important
    benefits
  • the computational overhead of the subsequent
    processing stages is reduced
  • noise may be reduced, as the data not contained
    in the first few components may be mostly due to
    noise
  • a projection into a subspace of a very low
    dimension, for example two, is useful for
    visualizing the data.
  • It is not necessary to use the s principal
    components themselves

12
Principal Component Analysis- Example
  • Principal component (PC1)
  • the direction along which there is greatest
    variation
  • Principal component (PC2)
  • the direction with maximum variation left in
    data, orthogonal to the first PC

13
PCA on all GenesLeukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes
reduced to 2
14
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients
reduced to 2
15
Drawbacks of Principal Component Analysis
  • PCA was designed for accurate data
    representation, not for data classification
  • Preserves as much variance in data as possible
  • If directions of maximum variance is important
    for classification, PCA will work
  • However the directions of maximum variance may be
    useless for classification

separable
not separable
Apply PCA to
each class
16
Principal Component Analysis in R
  • MASS package includes a dataset derived from
    Campbell, N.A. and Mahon, R.J. (1974) A
    multivariate study of variation in two species of
    rock crab of genus Leptograpsus. Australian
    Journal of Zoology, 22, 417-425.
  • R code
  • gt library(MASS)
  • gt data(crabs)
  • gt pp lt- prcomp(crabs, -c(1, 2, 3))
  • gt pp
  • plot first two principal components
  • gt plot(ppx, 1, ppx, 2, col
    ifelse(crabssp "O", "orange",
  • "blue"), pch 16, xlab "PC.1", ylab
    "PC.2")
  • gt attach(crabs)
  • gt library(scatterplot3d)
  • plot first three principal components in 3D
  • gt scatterplot3d(ppx, 1, ppx, 2, ppx, 3,
    color ifelse(crabssp
  • "O", "orange", "blue"), pch ifelse(crabssex
    "M", 1, 15))
  • plot all principal components pairwise
  • gt pairs(ppx, col ifelse(crabssp "O",
    "orange", "blue"),
  • pch ifelse(crabssex "M", 1, 16))

17
Principal Component Analysis in R
Principal components analysis of crab data first
two principal components
First 3 PC in a 3D display
All PC
Write a Comment
User Comments (0)
About PowerShow.com