David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1

Description:

Homework 2 due Tuesday Nov 6 in class. Let's make the Weka question ... diagonal but non-isotropic. dimensions do not have. equal variance. MH distance reduces ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 37
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 11: Dimensionality Reduction 1


1
CS 277 Data MiningLecture 11 Dimensionality
Reduction
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 2 due Tuesday Nov 6 in class
  • Lets make the Weka question EXTRA CREDIT
  • Any questions?

3
Todays lecture
  • Dimension reduction methods
  • Motivation
  • Linear projection techniques

4
Notation Reminder
  • n objects, each with p measurements
  • data vector for ith object
  • Data matrix
  • is the ith row, jth column
  • columns -gt variables
  • rows -gt data points
  • Can define distances/similarities
  • between rows (data vectors i)
  • between columns (variables j)

5
Distance
  • Makes sense in the case where the different
    measurements are commensurate each variable
    measured in the same units.
  • If the measurements are different, say length
    and weight, Euclidean distance is not necessarily
    meaningful.

6
Dependence among Variables
  • Covariance and correlation measure linear
    dependence (distance between variables, not
    objects)
  • Assume we have two variables or attributes X and
    Y and n objects taking on values x(1), , x(n)
    and y(1), , y(n). The sample covariance of X
    and Y is
  • The covariance is a measure of how X and Y vary
    together.
  • it will be large and positive if large values of
    X are associated with large values of Y, and
    small X ? small Y

7
Correlation coefficient
  • Covariance depends on ranges of X and Y
  • Standardize by dividing by standard deviation
  • Linear correlation coefficient is defined as

8
Sample Correlation Matrix
Data on characteristics of Boston suburbs
-1
0
1
average rooms
Median house value
9
Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix
  • It automatically accounts for the scaling of the
    coordinate axes
  • It corrects for correlation between the different
    features
  • Cost
  • The covariance matrices can be hard to determine
    accurately
  • The memory and time requirements grow
    quadratically, O(p2), rather than linearly with
    the number of features.

10
Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic ?
all dimensions have equal variance ? MH
distance reduces to Euclidean distance
11
Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic ? dimensions do not have equal
variance ? MH distance reduces to weighted
Euclidean distance with weights inverse
variance
12
Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
13
Dimension Reduction methods
  • Dimension reduction
  • From p-dimensional x to d-dimensional x , d lt
    p
  • Techniques
  • Variable selection
  • use an algorithm to find individual variables in
    x that are relevant to the problem and discard
    the rest
  • stepwise logistic regression
  • Linear projections
  • Project data to a lower-dimensional space
  • E.g., principal components
  • Non-linear embedding
  • Use a non-linear mapping to embed data in a
    lower-dimensional space
  • E.g., multidimensional scaling

14
Dimension Reduction why is it useful?
  • In general, incurs loss of information about x
  • So why do this?
  • If dimensionality p is very large (e.g., 1000s),
    representing the data in a lower-dimensional
    space may make learning more reliable,
  • e.g., clustering example
  • 100 dimensional data
  • but cluster structure is only present in 2 of the
    dimensions, the others are just noise
  • if other 98 dimensions are just noise (relative
    to cluster structure), then clusters will be much
    easier to discover if we just focus on the 2d
    space
  • Dimension reduction can also provide
    interpretation/insight
  • e.g for 2d visualization purposes
  • Caveat 2-step approach of dimension reduction
    followed by learning algorithm is in general
    suboptimal

15
Variable Selection Methods
  • p variables, would like to use a smaller subset
    in our model
  • e.g., in classification, do k-NN in d-space
    rather than p-space
  • e.g., for logistic regression, use d inputs
    rather than p
  • Problem
  • Number of subsets of p variables is O(2p)
  • Exhaustive search is impossible except for very
    small p
  • Typically the search problem is NP-hard
  • Common solution
  • Local systematic search (e.g., add/delete
    variables 1 at a time) to locally maximize a
    score function (i.e., hill-climbing)
  • e.g., add a variable, build new model, generate
    new score, etc
  • Can often work well, but can get trapped in local
    maxima/minima
  • Can also be computationally-intensive (depends on
    model)
  • Note some techniques such as decision tree
    predictors automatically perform dimension
    reduction as part of the learning algorithm.

16
Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
17
Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
18
Projections to more than 1 dimension
  • Multidimensional projections
  • e.g., x is 4-dimensional
  • a1T 0.71 0.71 0 0
  • a2T 0 0 0.71 0.71
  • AT x -gt coordinates of x in 2d space
    spanned by columns of A
  • -gt linear transformation from
    4d to 2d space
  • where A a1 a2

19
Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
20
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
21
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
22
Notes on PCA
  • Basis representation and approximation error
  • Scree diagrams
  • Computational complexity of computing PCA
  • Equivalent to solving set of linear equations,
    matrix inversion, singular value decomposition,
    etc.
  • Scales in general as O(np2 p3)
  • Many numerical tricks possible, e.g., for sparse
    X matrices, for finding only the first k
    eigenvectors, etc
  • In MATLAB can use eig or svd (also note sparse
    versions, eigs, svds)

23
Examples
  • Images
  • Text

24
20 face images
Courtesy of Matthew Turk, Alex Pentland
25
First 16 Eigenimages
26
First 4 eigenimages
27
Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
28
Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
29
Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
30
Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1
-14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
31
Reconstructing Image 6 with 16 eigenimages
32
PCA for text
  • U,S,V svds(X)
  • In text, called Latent Semantic Indexing (LSI)
  • ? Matlab

33
PCA Example
  • Data of rainfall over India
  • ? Matlab movie

34
Principal components of rainfall over India 0-3
35
Principal components of rainfall over India 4-7
36
Issues
  • Is PCA suitable for discrete/count data?
  • Example
  • doc1 cat cat dog
  • doc2 cat dog dog
  • X 2 1 1 2
  • U,S,V svd(X) ? Matlab
  • Next week
  • Probabilistic Latent Semantic Indexing
  • Nonnegative Matrix Factorization
Write a Comment
User Comments (0)
About PowerShow.com