Principal Component Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Principal Component Analysis

Description:

Principal Component Analysis Biosystems Data Analysis From molecule to networks Disease gene network Biological data How to explore such networks Goals If you measure ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 44
Provided by: Gur7
Category:

less

Transcript and Presenter's Notes

Title: Principal Component Analysis


1
Principal Component Analysis
  • Biosystems Data Analysis

2
From molecule to networks
Protein network of SRD5A2
Yeast metabolic network of Glycolysis
3
Disease gene network
4
Biological data
Genes or proteins or metabolites
DATA
Samples
5.31
5.31
Repeatability (herhaalbaarheid)
Reproducibility (reproduceerbaarheid)
Biological variability
5
How to explore such networks
Genes or proteins or metabolites
DATA
Genes or proteins or metabolites
Samples
Genes or proteins or metabolites
Correlation matrix
Results are specific for the selected
samples/situation
6
Goals
  • If you measure multiple variables on an object it
    can be important to analyze the measurements
    simultaneously.
  • Understand the most important tool in
    multivariate data analysis Principal Component
    Analysis.

7
Multiple measurements
  • If there is a mutual relationship between two or
    more measurements they are correlated.
  • There are strong correlations and weak
    correlations

Capabilities in sports and month of birth
Mass of an object and the weight of that object
on the earth surface
8
Correlation
  • Correlation occurs everywhere!
  • Example mean height vs. age of a group of young
    children
  • A strong linear relationship between height and
    age is seen.
  • For young children, height and age are
    correlated.

Moore, D.S. and McCabe G.P., Introduction to the
Practice of Statistics (1989).
9
Correlation in spectroscopy
?230
?265
  • Example a pure compound is measured at two
    wavelengths over a range of concentrations

0.9
0.8
0.7
0.6
0.5
Absorbance (units)
Intensity at 230nm
Intensity at 265nm
Conc. (MMol)
0.4
0.166
0.090
5
0.3
0.332
0.181
10
0.2
0.498
0.270
15
0.664
0.362
20
0.1
0.831
0.453
25
0
200
210
220
230
240
250
260
270
280
290
300
Wavelength (nm)
10
Correlation in spectroscopy
  • The intensities at ?230 and ?265 are highly
    correlated.

0.5
0.45
0.4
0.35
  • The data is not two-dimensional, but
    one-dimensional.

0.3
0.25
Absorbance at 265nm (units)
0.2
0.15
0.1
  • There is only one factor underlying the data
    concentration.

0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absorbance at 230nm (units)
11
The data matrix
  • Information often comes in the form of a matrix

variables
objects
  • For example,
  • Spectroscopy sample ? wavelength
  • Proteomics patient ? protein

12
Large amounts of data
  • In (bio)chemical analysis, the measured data
    matrices can be very large.
  • An infrared spectrum measured for 50 samples
    gives a data matrix of size 50 ? 800 40,000
    numbers!
  • The matabolome of a 100 patient yield a data
    matrix of size 100 ? 1000 100,000 numbers.
  • We need a way of extracting the important
    information from large data matrices.

13
Principal Component Analysis
  • Data reduction
  • PCA reduces large data matrices into two smaller
    matrices which can be more easily examined,
    plotted and interpreted.
  • Data exploration
  • PCA extracts the most important factors
    (principal components) from the data. These
    factors describe multivariate interactions
    between the measured variables.
  • Data understanding
  • Principal components can be used to classify
    samples, identify compound spectra, determine
    biomarker etc.

14
Different views of PCA
  • Statistically, PCA is a multivariate analysis
    technique closely related to
  • eigenvector analysis
  • singular value decomposition (SVD)
  • In matrix terms, PCA is a decomposition of X into
    two smaller matrices plus a set of residuals X
    TPT E
  • Geometrically, PCA is a projection technique in
    which X is projected onto a subspace of reduced
    dimensions.

15
PCA mathematics
  • The basic equation for PCA is written as
  • where
  • X (I ? J) is a data matrix,
  • T (I ? R) are the scores,
  • P (J ? R) are the loadings and
  • E (I ? J) are the residuals.
  • R is the number of principal components used to
    describe X.

16
Principal components
  • A principal component is defined by one pair of
    loadings and scores, , sometimes also
    known as a latent variable.
  • Principal components describe maximum variance
    and are calculated in order of importance, e.g.

and so on... up to 100
17
PCA matrices
loadings
...

scores
18
Scores and loadings
  • Scores
  • relationships between objects
  • orthogonal, TTT diagonal matrix
  • Loadings
  • relationships between variables
  • orthonormal, PTP identity matrix, I
  • Similarities and differences between objects (or
    variables) can be seen by plotting scores (or
    loadings) against each other.

19
Numbers example
20
PCA simple projection
  • Simplest case two correlated variables
  • PC1 describes 99.77 of the total variation in X.
  • PC2 describes residual variation (0.23).

21
PCA projections
  • PCA is a projection technique.
  • Each row of the data matrix X (I ? J) can be
    considered as a point in J-dimensional space.
    This data is projected orthogonally onto a
    subspace of lower dimensionality.
  • In the previous example, we projected the
    two-dimensional data onto a one-dimensional
    space, i.e. onto a line.
  • Now we will project some J-dimensional data onto
    a two-dimensional space, i.e. onto a plane.

22






?




?
23
Example Protein data
  • Protein consumption across Europe was studied.
  • 9 variables describe different sources of
    protein.
  • 25 objects are the different countries.
  • Data matrix has dimensions 25 ? 9.
  • Which countries are similar?
  • Which foods are related to red meat consumption?

Weber, A., Agrarpolitik im Spannungsfeld der
internationalen Ernaehrungspolitik, Institut fuer
Agrarpolitik und marktlehre, Kiel (1973) .
24
(No Transcript)
25
PCA on the protein data
  • The data is mean-centred and each variable is
    scaled to unit variance. Then a PCA is performed.

Percent Variance Captured by PCA Model
Principal Eigenvalue Variance
Variance Component of Captured
Captured Number Cov(X) This
PC Total --------- ----------
---------- ---------- 1
4.01e000 44.52 44.52 2
1.63e000 18.17 62.68 3
1.13e000 12.53 75.22
4 9.55e-001 10.61
85.82 5 4.64e-001 5.15
90.98 6 3.25e-001 3.61
94.59 7 2.72e-001
3.02 97.61 8 1.16e-001
1.29 98.90 9 9.91e-002
1.10 100.00 How many principal
components do you want to keep? 4
26
Scores PC1 vs PC2
27
Loadings
28
Biplot PC1 vs PC2
29
Biplot PC1 vs PC3
30
Residuals
  • It is also important to look at the model
    residuals, E.
  • Ideally, the residuals will not contain any
    structure - just unsystematic variation (noise).

31
Residuals
  • The (squared) model residuals can be summed along
    the object or variable direction

32
Centering and scaling
  • We are often interested in the differences
    between objects, not in their absolute values.
  • protein data differences between countries
  • If different variables are measured in different
    units, some scaling is needed to give each
    variable an equal chance of contributing to the
    model.

33
Mean-centering
  • Subtract the mean from each column of X

10840
36.75
0.0
0.0
34
Scaling
  • Divide each column of X by its standard deviation

704.8
1.139
1.0
1.0
35
How many PCs to use?
X TPT E
systematic variation
noise
  • Too few PCs
  • some systematic variation is not described.
  • model does not fully summarise the data.
  • Too many PCs
  • latter PCs describe noise.
  • model is not robust when applied to new data.
  • How to select the correct number of PCs?

36
How many PCs to use?
  • Eigenvalue plots

Knee here - select 4 PCs
  • Select components where explained variance gt
    noise level
  • Look at PC scores and loadings - do they make
    sense?! Do residuals have structure?
  • Cross-validation

37
Cross-validation
  • Remove subset of the data - test set.
















  • Build model on remaining data - training set.
  • Project test set onto model - calculate residuals.
  • Repeat for next test set.












  • Repeat for R 1,2,3...

38
PRESS plot
5
50
Eigenvalue of Cov(x) b)
PRESS (r)
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Latent Variable
39
Outliers
  • Outliers are objects which are very different
    from the rest of the data. These can have a large
    effect on the principal component model and
    should be removed.

bad experiment
40
Outliers
  • Outliers can also be found in the model space or
    in the residuals.

41
Model extrapolation can be dangerous!
42
Conclusions
  • Principal component analysis (PCA) reduces large,
    collinear matrices into two smaller matrices -
    scores and loadings
  • Principal components
  • describe the important variation in the data.
  • are calculated in order of importance.
  • are orthogonal.

43
Conclusions
  • Scores plots and biplots can be useful for
    exploring and understanding the data.
  • It is often correct to mean-center and scale the
    variables prior to analysis.
  • It is important to include the correct number of
    PCs in the PCA model. One method for determining
    this is called cross-validation.
Write a Comment
User Comments (0)
About PowerShow.com