Title: Principal Component Analysis
1Principal Component Analysis
2From molecule to networks
Protein network of SRD5A2
Yeast metabolic network of Glycolysis
3Disease gene network
4Biological data
Genes or proteins or metabolites
DATA
Samples
5.31
5.31
Repeatability (herhaalbaarheid)
Reproducibility (reproduceerbaarheid)
Biological variability
5How to explore such networks
Genes or proteins or metabolites
DATA
Genes or proteins or metabolites
Samples
Genes or proteins or metabolites
Correlation matrix
Results are specific for the selected
samples/situation
6Goals
- If you measure multiple variables on an object it
can be important to analyze the measurements
simultaneously. - Understand the most important tool in
multivariate data analysis Principal Component
Analysis.
7Multiple measurements
- If there is a mutual relationship between two or
more measurements they are correlated. - There are strong correlations and weak
correlations
Capabilities in sports and month of birth
Mass of an object and the weight of that object
on the earth surface
8Correlation
- Correlation occurs everywhere!
- Example mean height vs. age of a group of young
children - A strong linear relationship between height and
age is seen.
- For young children, height and age are
correlated.
Moore, D.S. and McCabe G.P., Introduction to the
Practice of Statistics (1989).
9Correlation in spectroscopy
?230
?265
- Example a pure compound is measured at two
wavelengths over a range of concentrations
0.9
0.8
0.7
0.6
0.5
Absorbance (units)
Intensity at 230nm
Intensity at 265nm
Conc. (MMol)
0.4
0.166
0.090
5
0.3
0.332
0.181
10
0.2
0.498
0.270
15
0.664
0.362
20
0.1
0.831
0.453
25
0
200
210
220
230
240
250
260
270
280
290
300
Wavelength (nm)
10Correlation in spectroscopy
- The intensities at ?230 and ?265 are highly
correlated.
0.5
0.45
0.4
0.35
- The data is not two-dimensional, but
one-dimensional.
0.3
0.25
Absorbance at 265nm (units)
0.2
0.15
0.1
- There is only one factor underlying the data
concentration.
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absorbance at 230nm (units)
11The data matrix
- Information often comes in the form of a matrix
variables
objects
- For example,
- Spectroscopy sample ? wavelength
- Proteomics patient ? protein
12Large amounts of data
- In (bio)chemical analysis, the measured data
matrices can be very large. - An infrared spectrum measured for 50 samples
gives a data matrix of size 50 ? 800 40,000
numbers! - The matabolome of a 100 patient yield a data
matrix of size 100 ? 1000 100,000 numbers. - We need a way of extracting the important
information from large data matrices.
13Principal Component Analysis
- Data reduction
- PCA reduces large data matrices into two smaller
matrices which can be more easily examined,
plotted and interpreted.
- Data exploration
- PCA extracts the most important factors
(principal components) from the data. These
factors describe multivariate interactions
between the measured variables.
- Data understanding
- Principal components can be used to classify
samples, identify compound spectra, determine
biomarker etc.
14Different views of PCA
- Statistically, PCA is a multivariate analysis
technique closely related to - eigenvector analysis
- singular value decomposition (SVD)
- In matrix terms, PCA is a decomposition of X into
two smaller matrices plus a set of residuals X
TPT E
- Geometrically, PCA is a projection technique in
which X is projected onto a subspace of reduced
dimensions.
15PCA mathematics
- The basic equation for PCA is written as
- where
- X (I ? J) is a data matrix,
- T (I ? R) are the scores,
- P (J ? R) are the loadings and
- E (I ? J) are the residuals.
- R is the number of principal components used to
describe X.
16Principal components
- A principal component is defined by one pair of
loadings and scores, , sometimes also
known as a latent variable.
- Principal components describe maximum variance
and are calculated in order of importance, e.g.
and so on... up to 100
17PCA matrices
loadings
...
scores
18Scores and loadings
- Scores
- relationships between objects
- orthogonal, TTT diagonal matrix
- Loadings
- relationships between variables
- orthonormal, PTP identity matrix, I
- Similarities and differences between objects (or
variables) can be seen by plotting scores (or
loadings) against each other.
19Numbers example
20PCA simple projection
- Simplest case two correlated variables
- PC1 describes 99.77 of the total variation in X.
- PC2 describes residual variation (0.23).
21PCA projections
- PCA is a projection technique.
- Each row of the data matrix X (I ? J) can be
considered as a point in J-dimensional space.
This data is projected orthogonally onto a
subspace of lower dimensionality.
- In the previous example, we projected the
two-dimensional data onto a one-dimensional
space, i.e. onto a line.
- Now we will project some J-dimensional data onto
a two-dimensional space, i.e. onto a plane.
22 ?
?
23Example Protein data
- Protein consumption across Europe was studied.
- 9 variables describe different sources of
protein. - 25 objects are the different countries.
- Data matrix has dimensions 25 ? 9.
- Which countries are similar?
- Which foods are related to red meat consumption?
Weber, A., Agrarpolitik im Spannungsfeld der
internationalen Ernaehrungspolitik, Institut fuer
Agrarpolitik und marktlehre, Kiel (1973) .
24(No Transcript)
25PCA on the protein data
- The data is mean-centred and each variable is
scaled to unit variance. Then a PCA is performed.
Percent Variance Captured by PCA Model
Principal Eigenvalue Variance
Variance Component of Captured
Captured Number Cov(X) This
PC Total --------- ----------
---------- ---------- 1
4.01e000 44.52 44.52 2
1.63e000 18.17 62.68 3
1.13e000 12.53 75.22
4 9.55e-001 10.61
85.82 5 4.64e-001 5.15
90.98 6 3.25e-001 3.61
94.59 7 2.72e-001
3.02 97.61 8 1.16e-001
1.29 98.90 9 9.91e-002
1.10 100.00 How many principal
components do you want to keep? 4
26Scores PC1 vs PC2
27Loadings
28Biplot PC1 vs PC2
29Biplot PC1 vs PC3
30Residuals
- It is also important to look at the model
residuals, E.
- Ideally, the residuals will not contain any
structure - just unsystematic variation (noise).
31Residuals
- The (squared) model residuals can be summed along
the object or variable direction
32Centering and scaling
- We are often interested in the differences
between objects, not in their absolute values. - protein data differences between countries
- If different variables are measured in different
units, some scaling is needed to give each
variable an equal chance of contributing to the
model.
33Mean-centering
- Subtract the mean from each column of X
10840
36.75
0.0
0.0
34Scaling
- Divide each column of X by its standard deviation
704.8
1.139
1.0
1.0
35How many PCs to use?
X TPT E
systematic variation
noise
- Too few PCs
- some systematic variation is not described.
- model does not fully summarise the data.
- Too many PCs
- latter PCs describe noise.
- model is not robust when applied to new data.
- How to select the correct number of PCs?
36How many PCs to use?
Knee here - select 4 PCs
- Select components where explained variance gt
noise level
- Look at PC scores and loadings - do they make
sense?! Do residuals have structure?
37Cross-validation
- Remove subset of the data - test set.
- Build model on remaining data - training set.
- Project test set onto model - calculate residuals.
- Repeat for next test set.
38PRESS plot
5
50
Eigenvalue of Cov(x) b)
PRESS (r)
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Latent Variable
39Outliers
- Outliers are objects which are very different
from the rest of the data. These can have a large
effect on the principal component model and
should be removed.
bad experiment
40Outliers
- Outliers can also be found in the model space or
in the residuals.
41Model extrapolation can be dangerous!
42Conclusions
- Principal component analysis (PCA) reduces large,
collinear matrices into two smaller matrices -
scores and loadings
- Principal components
- describe the important variation in the data.
- are calculated in order of importance.
- are orthogonal.
43Conclusions
- Scores plots and biplots can be useful for
exploring and understanding the data.
- It is often correct to mean-center and scale the
variables prior to analysis.
- It is important to include the correct number of
PCs in the PCA model. One method for determining
this is called cross-validation.