David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1

1
CS 277 Data MiningLecture 11 Dimensionality
Reduction

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 2 due Tuesday Nov 6 in class
Lets make the Weka question EXTRA CREDIT
Any questions?

3
Todays lecture

Dimension reduction methods
Motivation
Linear projection techniques

4
Notation Reminder

n objects, each with p measurements
data vector for ith object

Data matrix
is the ith row, jth column
columns -gt variables
rows -gt data points
Can define distances/similarities
between rows (data vectors i)
between columns (variables j)

5
Distance

Makes sense in the case where the different
measurements are commensurate each variable
measured in the same units.
If the measurements are different, say length
and weight, Euclidean distance is not necessarily
meaningful.

6
Dependence among Variables

Covariance and correlation measure linear
dependence (distance between variables, not
objects)
Assume we have two variables or attributes X and
Y and n objects taking on values x(1), , x(n)
and y(1), , y(n). The sample covariance of X
and Y is
The covariance is a measure of how X and Y vary
together.
it will be large and positive if large values of
X are associated with large values of Y, and
small X ? small Y

7
Correlation coefficient

Covariance depends on ranges of X and Y
Standardize by dividing by standard deviation
Linear correlation coefficient is defined as

8
Sample Correlation Matrix
Data on characteristics of Boston suburbs
-1
0
1
average rooms
Median house value
9
Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix

It automatically accounts for the scaling of the
coordinate axes
It corrects for correlation between the different
features
Cost
The covariance matrices can be hard to determine
accurately
The memory and time requirements grow
quadratically, O(p2), rather than linearly with
the number of features.

10
Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic ?
all dimensions have equal variance ? MH
distance reduces to Euclidean distance
11
Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic ? dimensions do not have equal
variance ? MH distance reduces to weighted
Euclidean distance with weights inverse
variance
12
Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
13
Dimension Reduction methods

Dimension reduction
From p-dimensional x to d-dimensional x , d lt
p
Techniques
Variable selection
use an algorithm to find individual variables in
x that are relevant to the problem and discard
the rest
stepwise logistic regression
Linear projections
Project data to a lower-dimensional space
E.g., principal components
Non-linear embedding
Use a non-linear mapping to embed data in a
lower-dimensional space
E.g., multidimensional scaling

14
Dimension Reduction why is it useful?

In general, incurs loss of information about x
So why do this?
If dimensionality p is very large (e.g., 1000s),
representing the data in a lower-dimensional
space may make learning more reliable,
e.g., clustering example
100 dimensional data
but cluster structure is only present in 2 of the
dimensions, the others are just noise
if other 98 dimensions are just noise (relative
to cluster structure), then clusters will be much
easier to discover if we just focus on the 2d
space
Dimension reduction can also provide
interpretation/insight
e.g for 2d visualization purposes
Caveat 2-step approach of dimension reduction
followed by learning algorithm is in general
suboptimal

15
Variable Selection Methods

p variables, would like to use a smaller subset
in our model
e.g., in classification, do k-NN in d-space
rather than p-space
e.g., for logistic regression, use d inputs
rather than p
Problem
Number of subsets of p variables is O(2p)
Exhaustive search is impossible except for very
small p
Typically the search problem is NP-hard
Common solution
Local systematic search (e.g., add/delete
variables 1 at a time) to locally maximize a
score function (i.e., hill-climbing)
e.g., add a variable, build new model, generate
new score, etc
Can often work well, but can get trapped in local
maxima/minima
Can also be computationally-intensive (depends on
model)
Note some techniques such as decision tree
predictors automatically perform dimension
reduction as part of the learning algorithm.

16
Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
17
Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
18
Projections to more than 1 dimension

Multidimensional projections
e.g., x is 4-dimensional
a1T 0.71 0.71 0 0
a2T 0 0 0.71 0.71
AT x -gt coordinates of x in 2d space
spanned by columns of A
-gt linear transformation from
4d to 2d space
where A a1 a2

19
Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
20
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
21
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
22
Notes on PCA

Basis representation and approximation error
Scree diagrams
Computational complexity of computing PCA
Equivalent to solving set of linear equations,
matrix inversion, singular value decomposition,
etc.
Scales in general as O(np2 p3)
Many numerical tricks possible, e.g., for sparse
X matrices, for finding only the first k
eigenvectors, etc
In MATLAB can use eig or svd (also note sparse
versions, eigs, svds)

23
Examples

Images
Text

24
20 face images
Courtesy of Matthew Turk, Alex Pentland
25
First 16 Eigenimages
26
First 4 eigenimages
27
Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
28
Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
29
Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
30
Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1
-14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
31
Reconstructing Image 6 with 16 eigenimages
32
PCA for text

U,S,V svds(X)
In text, called Latent Semantic Indexing (LSI)
? Matlab

33
PCA Example

Data of rainfall over India
? Matlab movie

34
Principal components of rainfall over India 0-3
35
Principal components of rainfall over India 4-7
36
Issues

Is PCA suitable for discrete/count data?
Example
doc1 cat cat dog
doc2 cat dog dog
X 2 1 1 2
U,S,V svd(X) ? Matlab
Next week
Probabilistic Latent Semantic Indexing
Nonnegative Matrix Factorization

Write a Comment

User Comments (0)

About PowerShow.com

David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1 PowerPoint PPT Presentation