Title: David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1
1CS 277 Data MiningLecture 11 Dimensionality
Reduction
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 2 due Tuesday Nov 6 in class
- Lets make the Weka question EXTRA CREDIT
- Any questions?
3Todays lecture
- Dimension reduction methods
- Motivation
- Linear projection techniques
4Notation Reminder
- n objects, each with p measurements
- data vector for ith object
- Data matrix
- is the ith row, jth column
- columns -gt variables
- rows -gt data points
- Can define distances/similarities
- between rows (data vectors i)
- between columns (variables j)
5Distance
- Makes sense in the case where the different
measurements are commensurate each variable
measured in the same units. - If the measurements are different, say length
and weight, Euclidean distance is not necessarily
meaningful.
6Dependence among Variables
- Covariance and correlation measure linear
dependence (distance between variables, not
objects) - Assume we have two variables or attributes X and
Y and n objects taking on values x(1), , x(n)
and y(1), , y(n). The sample covariance of X
and Y is - The covariance is a measure of how X and Y vary
together. - it will be large and positive if large values of
X are associated with large values of Y, and
small X ? small Y
7Correlation coefficient
- Covariance depends on ranges of X and Y
- Standardize by dividing by standard deviation
- Linear correlation coefficient is defined as
8Sample Correlation Matrix
Data on characteristics of Boston suburbs
-1
0
1
average rooms
Median house value
9Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix
- It automatically accounts for the scaling of the
coordinate axes - It corrects for correlation between the different
features - Cost
- The covariance matrices can be hard to determine
accurately - The memory and time requirements grow
quadratically, O(p2), rather than linearly with
the number of features.
10Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic ?
all dimensions have equal variance ? MH
distance reduces to Euclidean distance
11Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic ? dimensions do not have equal
variance ? MH distance reduces to weighted
Euclidean distance with weights inverse
variance
12Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
13Dimension Reduction methods
- Dimension reduction
- From p-dimensional x to d-dimensional x , d lt
p - Techniques
- Variable selection
- use an algorithm to find individual variables in
x that are relevant to the problem and discard
the rest - stepwise logistic regression
- Linear projections
- Project data to a lower-dimensional space
- E.g., principal components
- Non-linear embedding
- Use a non-linear mapping to embed data in a
lower-dimensional space - E.g., multidimensional scaling
14Dimension Reduction why is it useful?
- In general, incurs loss of information about x
- So why do this?
- If dimensionality p is very large (e.g., 1000s),
representing the data in a lower-dimensional
space may make learning more reliable, - e.g., clustering example
- 100 dimensional data
- but cluster structure is only present in 2 of the
dimensions, the others are just noise - if other 98 dimensions are just noise (relative
to cluster structure), then clusters will be much
easier to discover if we just focus on the 2d
space - Dimension reduction can also provide
interpretation/insight - e.g for 2d visualization purposes
- Caveat 2-step approach of dimension reduction
followed by learning algorithm is in general
suboptimal
15Variable Selection Methods
- p variables, would like to use a smaller subset
in our model - e.g., in classification, do k-NN in d-space
rather than p-space - e.g., for logistic regression, use d inputs
rather than p - Problem
- Number of subsets of p variables is O(2p)
- Exhaustive search is impossible except for very
small p - Typically the search problem is NP-hard
- Common solution
- Local systematic search (e.g., add/delete
variables 1 at a time) to locally maximize a
score function (i.e., hill-climbing) - e.g., add a variable, build new model, generate
new score, etc - Can often work well, but can get trapped in local
maxima/minima - Can also be computationally-intensive (depends on
model) - Note some techniques such as decision tree
predictors automatically perform dimension
reduction as part of the learning algorithm.
16Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
17Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
18Projections to more than 1 dimension
- Multidimensional projections
- e.g., x is 4-dimensional
- a1T 0.71 0.71 0 0
- a2T 0 0 0.71 0.71
- AT x -gt coordinates of x in 2d space
spanned by columns of A - -gt linear transformation from
4d to 2d space - where A a1 a2
19Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
20PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
21PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
22Notes on PCA
- Basis representation and approximation error
- Scree diagrams
- Computational complexity of computing PCA
- Equivalent to solving set of linear equations,
matrix inversion, singular value decomposition,
etc. - Scales in general as O(np2 p3)
- Many numerical tricks possible, e.g., for sparse
X matrices, for finding only the first k
eigenvectors, etc - In MATLAB can use eig or svd (also note sparse
versions, eigs, svds)
23Examples
2420 face images
Courtesy of Matthew Turk, Alex Pentland
25First 16 Eigenimages
26First 4 eigenimages
27Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
28Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
29Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
30Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1
-14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
31Reconstructing Image 6 with 16 eigenimages
32PCA for text
- U,S,V svds(X)
- In text, called Latent Semantic Indexing (LSI)
- ? Matlab
33PCA Example
- Data of rainfall over India
- ? Matlab movie
34Principal components of rainfall over India 0-3
35Principal components of rainfall over India 4-7
36Issues
- Is PCA suitable for discrete/count data?
- Example
- doc1 cat cat dog
- doc2 cat dog dog
- X 2 1 1 2
- U,S,V svd(X) ? Matlab
- Next week
- Probabilistic Latent Semantic Indexing
- Nonnegative Matrix Factorization