Principal Component Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Principal Component Analysis

Description:

Principal Component Analysis Biosystems Data Analysis From molecule to networks Disease gene network Biological data How to explore such networks Goals If you measure ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 44

Provided by: Gur7

Category:

more less

Transcript and Presenter's Notes

Title: Principal Component Analysis

1
Principal Component Analysis

Biosystems Data Analysis

2
From molecule to networks
Protein network of SRD5A2
Yeast metabolic network of Glycolysis
3
Disease gene network
4
Biological data
Genes or proteins or metabolites
DATA
Samples
5.31
5.31
Repeatability (herhaalbaarheid)
Reproducibility (reproduceerbaarheid)
Biological variability
5
How to explore such networks
Genes or proteins or metabolites
DATA
Genes or proteins or metabolites
Samples
Genes or proteins or metabolites
Correlation matrix
Results are specific for the selected
samples/situation
6
Goals

If you measure multiple variables on an object it
can be important to analyze the measurements
simultaneously.
Understand the most important tool in
multivariate data analysis Principal Component
Analysis.

7
Multiple measurements

If there is a mutual relationship between two or
more measurements they are correlated.
There are strong correlations and weak
correlations

Capabilities in sports and month of birth
Mass of an object and the weight of that object
on the earth surface
8
Correlation

Correlation occurs everywhere!

Example mean height vs. age of a group of young
children
A strong linear relationship between height and
age is seen.

For young children, height and age are
correlated.

Moore, D.S. and McCabe G.P., Introduction to the
Practice of Statistics (1989).
9
Correlation in spectroscopy
?230
?265

Example a pure compound is measured at two
wavelengths over a range of concentrations

0.9
0.8
0.7
0.6
0.5
Absorbance (units)
Intensity at 230nm
Intensity at 265nm
Conc. (MMol)
0.4
0.166
0.090
5
0.3
0.332
0.181
10
0.2
0.498
0.270
15
0.664
0.362
20
0.1
0.831
0.453
25
0
200
210
220
230
240
250
260
270
280
290
300
Wavelength (nm)
10
Correlation in spectroscopy

The intensities at ?230 and ?265 are highly
correlated.

0.5
0.45
0.4
0.35

The data is not two-dimensional, but
one-dimensional.

0.3
0.25
Absorbance at 265nm (units)
0.2
0.15
0.1

There is only one factor underlying the data
concentration.

0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Absorbance at 230nm (units)
11
The data matrix

Information often comes in the form of a matrix

variables
objects

For example,
Spectroscopy sample ? wavelength
Proteomics patient ? protein

12
Large amounts of data

In (bio)chemical analysis, the measured data
matrices can be very large.
An infrared spectrum measured for 50 samples
gives a data matrix of size 50 ? 800 40,000
numbers!
The matabolome of a 100 patient yield a data
matrix of size 100 ? 1000 100,000 numbers.
We need a way of extracting the important
information from large data matrices.

13
Principal Component Analysis

Data reduction
PCA reduces large data matrices into two smaller
matrices which can be more easily examined,
plotted and interpreted.

Data exploration
PCA extracts the most important factors
(principal components) from the data. These
factors describe multivariate interactions
between the measured variables.

Data understanding
Principal components can be used to classify
samples, identify compound spectra, determine
biomarker etc.

14
Different views of PCA

Statistically, PCA is a multivariate analysis
technique closely related to
eigenvector analysis
singular value decomposition (SVD)

In matrix terms, PCA is a decomposition of X into
two smaller matrices plus a set of residuals X
TPT E

Geometrically, PCA is a projection technique in
which X is projected onto a subspace of reduced
dimensions.

15
PCA mathematics

The basic equation for PCA is written as
where
X (I ? J) is a data matrix,
T (I ? R) are the scores,
P (J ? R) are the loadings and
E (I ? J) are the residuals.
R is the number of principal components used to
describe X.

16
Principal components

A principal component is defined by one pair of
loadings and scores, , sometimes also
known as a latent variable.

Principal components describe maximum variance
and are calculated in order of importance, e.g.

and so on... up to 100
17
PCA matrices
loadings
...

scores
18
Scores and loadings

Scores
relationships between objects
orthogonal, TTT diagonal matrix
Loadings
relationships between variables
orthonormal, PTP identity matrix, I
Similarities and differences between objects (or
variables) can be seen by plotting scores (or
loadings) against each other.

19
Numbers example
20
PCA simple projection

Simplest case two correlated variables

PC1 describes 99.77 of the total variation in X.

PC2 describes residual variation (0.23).

21
PCA projections

PCA is a projection technique.

Each row of the data matrix X (I ? J) can be
considered as a point in J-dimensional space.
This data is projected orthogonally onto a
subspace of lower dimensionality.

In the previous example, we projected the
two-dimensional data onto a one-dimensional
space, i.e. onto a line.

Now we will project some J-dimensional data onto
a two-dimensional space, i.e. onto a plane.

22

?

?
23
Example Protein data

Protein consumption across Europe was studied.
9 variables describe different sources of
protein.
25 objects are the different countries.
Data matrix has dimensions 25 ? 9.

Which countries are similar?

Which foods are related to red meat consumption?

Weber, A., Agrarpolitik im Spannungsfeld der
internationalen Ernaehrungspolitik, Institut fuer
Agrarpolitik und marktlehre, Kiel (1973) .
24
(No Transcript)
25
PCA on the protein data

The data is mean-centred and each variable is
scaled to unit variance. Then a PCA is performed.

Percent Variance Captured by PCA Model
Principal Eigenvalue Variance
Variance Component of Captured
Captured Number Cov(X) This
PC Total --------- ----------
---------- ---------- 1
4.01e000 44.52 44.52 2
1.63e000 18.17 62.68 3
1.13e000 12.53 75.22
4 9.55e-001 10.61
85.82 5 4.64e-001 5.15
90.98 6 3.25e-001 3.61
94.59 7 2.72e-001
3.02 97.61 8 1.16e-001
1.29 98.90 9 9.91e-002
1.10 100.00 How many principal
components do you want to keep? 4
26
Scores PC1 vs PC2
27
Loadings
28
Biplot PC1 vs PC2
29
Biplot PC1 vs PC3
30
Residuals

It is also important to look at the model
residuals, E.

Ideally, the residuals will not contain any
structure - just unsystematic variation (noise).

31
Residuals

The (squared) model residuals can be summed along
the object or variable direction

32
Centering and scaling

We are often interested in the differences
between objects, not in their absolute values.
protein data differences between countries

If different variables are measured in different
units, some scaling is needed to give each
variable an equal chance of contributing to the
model.

33
Mean-centering

Subtract the mean from each column of X

10840
36.75
0.0
0.0
34
Scaling

Divide each column of X by its standard deviation

704.8
1.139
1.0
1.0
35
How many PCs to use?
X TPT E
systematic variation
noise

Too few PCs
some systematic variation is not described.
model does not fully summarise the data.

Too many PCs
latter PCs describe noise.
model is not robust when applied to new data.

How to select the correct number of PCs?

36
How many PCs to use?

Eigenvalue plots

Knee here - select 4 PCs

Select components where explained variance gt
noise level

Look at PC scores and loadings - do they make
sense?! Do residuals have structure?

Cross-validation

37
Cross-validation

Remove subset of the data - test set.

Build model on remaining data - training set.

Project test set onto model - calculate residuals.

Repeat for next test set.

Repeat for R 1,2,3...

38
PRESS plot
5
50
Eigenvalue of Cov(x) b)
PRESS (r)
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Latent Variable
39
Outliers

Outliers are objects which are very different
from the rest of the data. These can have a large
effect on the principal component model and
should be removed.

bad experiment
40
Outliers