Title: Principal Component Analysis
1Principal Component Analysis
2Introduction
- Principal Component Analysis (PCA) is a
statistical technique, it can be used for many
things including data compression and shape
recognition. - A fluorescence spectrum has a shape which PCA can
recognise.
3Method of PCA part 1.The Covariance Matrix
- It is important to understand a set of data in
terms of its dimensions and how they vary
together. - For a set of 2-dimensional data it is important
to know if high values in one dimension cause
values in the other to be high, or the opposite
effect, where high values in one dimension cause
there to be low values in another dimension.
4Method of PCA part 1.The Covariance Matrix
- One way of telling if one dimension of data
affects another is to find the covariance between
the two dimensions. - Covariance can be derived from the formula for
standard deviation, where X and Y are data
dimensions
It should be noted here that the first step in
finding the covariance is to subtract the mean of
data in a dimension from each value in that
dimension. This data set, called row data
adjust, will be used again later.
5Method of PCA part 1.The Covariance Matrix
- When covariance between two dimensions is
positive the values in those two dimensions
increase together, when it is negative they
decrease together. - When a data set comprises many dimensions, it is
important to lay out the covariance values in a
matrix. That is, the covariance matrix.
6Method of PCA part 1.The Covariance Matrix
- From the definition of covariance, it should be
obvious that - A consequence of this is that (in the case of
3dimensional data) the covariance matrix is
always a symmetric square matrix with the
variances down the main diagonal.
7Method of PCA part 2.Finding Principal Components
- The principal components of the data set are,
simply, the eigenvectors of the covariance
matrix. - There are as many eigenvectors as there are
dimensions in the data set. - The eigenvector with the highest corresponding
eigenvalue contains the most information about
the original data set. This information will
correspond to large features in the shape of the
original data set.
First Principal Component
Original Data
8Method of PCA part 2.Finding Principal Components
- As the value of the corresponding eigenvalue
falls, so too does the size of the features
represented by the eigenvector. In this specimen
set of results, principal components 1-5 are
shown of a set of data comprising 4 peaks of
random height added to a set of 200 points of
random noise. Notice that the peak heights fall
as PC number rises, in fact all of the principal
components above PC4 contain only noise.
Original Data
PC1
PC2
PC3
PC4
PC5
9Method of PCA part 3Data Reconstruction
- If the matrix dot product of this principal
component and the row data adjust data set were
taken, then the output data set (called final
data) would have a similar shape to that of the
original data however it is likely that it
would be a poor representation of the original
data.
10Method of PCA part 3Data Reconstruction
- If the dot product of all the principal
components lined up in a matrix (called a row
feature vector) and the row data adjust data
set was taken it would be exactly the original
data set returned. - So, if we could choose which principal components
were put into the row feature vector, then it
would be possible to control the detail of the
data returned. - In the previous example, it was clear that there
was no significant data relating to the peaks in
the input graph in any principal component above
number 4.
11PCA in Fluorescence Spectroscopy
- In the study of drug fluorescence, it is
necessary to study spectra similar to the
patterns of data shown in previous examples. - These are created by shining a light from a
source onto a drug sample, a spectrum is then
reflected and received by a computer as data. - In these spectra there is likely to be erroneous
noise and large shapes caused by unwanted signals
such as reflection of light from a source, and
noise caused by a number of things such as
electrical interference and fluctuations in the
output of the source. - Principal Component Analysis will be able to
remove these unwanted shapes from spectra by
careful creation of the row feature vector
mentioned earlier.