Title: Principal Components Analysis
1Principal Components Analysis
- Hal Whitehead
- BIOL4062/5062
2Principal Components AnalysisPCA
- A. K. A.
- latent vectors
- latent variates
- principal axes
- principal factors
- etc
3Principal Components AnalysisPrincipal purpose
- Reducing dimensionality
- large body of data to manageable set
4PCA General methodology
- From k original variables x1,x2,...,xk
- Produce k new variables y1,y2,...,yk
- y1 a11x1 a12x2 ... a1kxk
- y2 a21x1 a22x2 ... a2kxk
- ...
- yk ak1x1 ak2x2 ... akkxk
5PCA General methodology
- From k original variables x1,x2,...,xk
- Produce k new variables y1,y2,...,yk
- y1 a11x1 a12x2 ... a1kxk
- y2 a21x1 a22x2 ... a2kxk
- ...
- yk ak1x1 ak2x2 ... akkxk
such that yk's are uncorrelated (orthogonal) y1
explains as much as possible of original variance
in data set y2 explains as much as possible of
remaining variance etc.
6Principal Components Analysis
7PCA General methodology
- From k original variables x1,x2,...,xk
- Produce k new variables y1,y2,...,yk
- y1 a11x1 a12x2 ... a1kxk
- y2 a21x1 a22x2 ... a2kxk
- ...
- yk ak1x1 ak2x2 ... akkxk
yk's are Principal Components
such that yk's are uncorrelated (orthogonal) y1
explains as much as possible of original variance
in data set y2 explains as much as possible of
remaining variance etc.
8Principal Components Analysis
- Rotates multivariate dataset into a new
configuration which is easier to interpret - Purposes
- simplify data
- look at relationships between variables
- look at patterns of units
9Principal Components Analysis
- Uses
- Correlation matrix, or
- Covariance matrix when variables in same units
(morphometrics, etc.)
10Principal Components Analysis
- a11,a12,...,a1k is 1st Eigenvector of
correlation/covariance matrix, and coefficients
of first principal component -
- a21,a22,...,a2k is 2nd Eigenvector of
correlation/covariance matrix, and coefficients
of 2nd principal component -
- ak1,ak2,...,akk is kth Eigenvector
of correlation/covariance matrix,
and coefficients of kth principal component
11Principal Components Analysis
- So, principal components are given by
- y1 a11x1 a12x2 ... a1kxk
- y2 a21x1 a22x2 ... a2kxk
- ...
- yk ak1x1 ak2x2 ... akkxk
- xjs are standardized if correlation matrix is
used (mean 0.0, SD 1.0)
12Principal Components Analysis
- Score of ith unit on jth principal component
- yi,j aj1xi1 aj2xi2 ... ajkxik
13PCA Scores
14Principal Components Analysis
- Amount of variance accounted for by
- 1st principal component, ?1, 1st eigenvalue
- 2nd principal component, ?2, 2nd eigenvalue
- ...
- ?1 gt ?2 gt ?3 gt ?4 gt ...
- Average ?j 1 (correlation matrix)
15Principal Components AnalysisEigenvalues
16PCA Terminology
- jth principal component is jth eigenvector
of correlation/covariance matrix - coefficients, ajk, are elements of eigenvectors
and relate original variables (standardized if
using correlation matrix) to components - scores are values of units on components
(produced using coefficients) - amount of variance accounted for by component is
given by eigenvalue, ?j - proportion of variance accounted for by component
is given by ?j / S ?j - loading of kth original variable on jth component
is given by ajkv?j --correlation between
variable and component
17How many components to use?
- If ?j lt 1 then component explains less variance
than original variable (correlation matrix) - Use 2 components (or 3) for visual ease
- Scree diagram
18Principal Components Analysis on
- Covariance Matrix
- Variables must be in same units
- Emphasizes variables with most variance
- Mean eigenvalue ?1.0
- Useful in morphometrics, a few other cases
- Correlation Matrix
- Variables are standardized (mean 0.0, SD 1.0)
- Variables can be in different units
- All variables have same impact on analysis
- Mean eigenvalue 1.0
19PCA Potential Problems
- Lack of Independence
- NO PROBLEM
- Lack of Normality
- Normality desirable but not essential
- Lack of Precision
- Precision desirable but not essential
- Many Zeroes in Data Matrix
- Problem (use Correspondence Analysis)
20Hourly records of sperm whale behaviour
- Data collected
- Off Galapagos Islands
- 1985 and 1987
- Units
- hours spent following sperm whales
- 440 hours
- Variables
- Mean cluster size
- Max. cluster size
- Mean speed
- Heading consistency
- Fluke-up rate
- Breach rate
- Lobtail rate
- Spyhop rate
- Sidefluke rate
- Coda rate
- Creak rate
- High click rate
21(No Transcript)
22(No Transcript)
23Scores plots
24Rotations of Principal Components(Exploratory
Factor Analysis)
- Factors are rotated components
- (just rotate a few principal components)
- Varimax tries to maximize variance of squared
loadings for each factor (orthogonal) - lines up factors with original variables
- improves interpretability of factors
- Quartimax tries to minimize sums of squares of
products of loadings (orthogonal)
25US Crime Statistics
- Variables
- Murder
- Rape
- Robbery
- Assault
- Burglary
- Larceny
- Autotheft
26Crime Statistics
- Component loadings
-
- 1 2
-
- MURDER 0.557 -0.771
- RAPE 0.851 -0.139
- ROBBERY 0.782 0.055
- ASSAULT 0.784 -0.546
- BURGLARY 0.881 0.308
- LARCENY 0.728 0.480
- AUTOTHFT 0.714 0.438
27Crime Statistics Component Loadings
28Crime Statistics Scores Plot
Crimes against people
Crimes against property
29Procedure for principal components analysis
- 1. Decide whether to use correlation or
covariance matrix - 2. Find eigenvectors (components) and
eigenvalues (variance accounted for) - 3. Decide how many components to use by
examining eigenvalues (perhaps using scree
diagram) - 4. Examine loadings (perhaps vector loading
plot) - 5. Plot scores
- 6. Try rotation--go to step 4