ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data

1
ICS 278 Data MiningLecture 5 Low-Dimensional
Representations of High-Dimensional Data

2
Todays lecture

Extend project proposal deadline to Monday, 8am
questions?
Outline of todays lecture
orphan slides from earlier lectures
Dimension reduction methods
Motivation
Variable selection methods
Linear projection techniques
Non-linear embedding methods

3
Notation Reminder

n objects, each with p measurements
data vector for ith object

Data matrix
is the ith row, jth column
columns -gt variables
rows -gt data points
Can define distances/similarities
between rows (data vectors i)
between columns (variables j)

4
Distance

Makes sense in the case where the different
measurements are commensurate each variable
measured in the same units.
If the measurements are different, say length
and weight, Euclidean distance is not necessarily
meaningful.

5
Dependence among Variables

Covariance and correlation measure linear
dependence (distance between variables, not
objects)
Assume we have two variables or attributes X and
Y and n objects taking on values x(1), , x(n)
and y(1), , y(n). The sample covariance of X
and Y is
The covariance is a measure of how X and Y vary
together.
it will be large and positive if large values of
X are associated with large values of Y, and
small X ? small Y

6
Correlation coefficient

Covariance depends on ranges of X and Y
Standardize by dividing by standard deviation
Linear correlation coefficient is defined as

7
Sample Correlation Matrix
-1
0
1
average rooms
Data on characteristics of Boston surburbs
Median house value
8
Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix

It automatically accounts for the scaling of the
coordinate axes
It corrects for correlation between the different
features
Cost
The covariance matrices can be hard to determine
accurately
The memory and time requirements grow
quadratically, O(p2), rather than linearly with
the number of features.

9
Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic -gt
all dimensions have equal variance -gt MH
distance reduces to Euclidean distance
10
Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic -gt dimensions do not have equal
variance -gt MH distance reduces to weighted
Euclidean distance with weights inverse
variance
11
Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
12
Distances between Binary Vectors
Number of variables where item j 1 and item i 0

matching coefficient

13
Other distance metrics

Categorical variables
Number of matches divided by number of dimensions
Distances between strings of different lengths
e.g., Patrick J. Smyth and Padhraic Smyth
Edit distance
Distances between images and waveforms
Shift-invariant, scale invariant
i.e., d(x,y) min_a,b ( (axb) y)
More generally, kernel methods

14
Transforming Data

Duality between form of the data and the model
Useful to bring data onto a natural scale
Some variables are very skewed, e.g., income
Common transforms square root, reciprocal,
logarithm, raising to a power
Often very useful when dealing with skewed
real-world data
Logit transforms from 0 to 1 to real-line

15
Data Quality

Individual measurements
Random noise in individual measurements
Variance (precision)
Bias
Random data entry errors
Noise in label assignment (e.g., class labels in
medical data sets)
Systematic errors
E.g., all ages gt 99 recorded as 99
More individuals aged 20, 30, 40, etc than
expected
Missing information
Missing at random
Questions on a questionnaire that people randomly
forget to fill in
Missing systematically
Questions that people dont want to answer
Patients who are too ill for a certain test

16
Data Quality

Collections of measurements
Ideal case random sample from population of
interest
Real case often a biased sample of some sort
Key point patterns or models built on the
training data may only be valid on future data
that comes from the same distribution
Examples of non-randomly sampled data
Medical study where subjects are all students
Geographic dependencies
Temporal dependencies
Stratified samples
E.g., 50 healthy, 50 ill
Hidden systematic effects
E.g., market basket data the weekend of a large
sale in the store
E.g., Web log data during finals week

17
Classifier technology and the illusion of
progress (abstract for workshop on
State-of-theArt in Supervised Classification, May
2006) Professor David J. Hand, Imperial College,
London Supervised classification methods are
widely used in data mining. Highly sophisticated
methods have been developed, using the full power
of recent advances in computation. However, these
advances have largely taken place within the
context of a classical paradigm, in which
construction of the classification rule is based
on a design sample of data randomly sampled
from unknown but well defined distributions of
the classes. In this paper, I argue that this
paradigm fails to take account of other sources
of uncertainty in the classification problem, and
that these other sources lead to uncertainties
which often swamp those arising from the
classical ones of estimation and prediction.
Several examples of such sources are given,
including imprecision in the definitions of the
classes, sample selectivity bias, population
drift, and use of inappropriate optimisation
criteria when fitting the model. Furthermore, it
is argued, there are both theoretical arguments
and practical evidence supporting the assertion
that the marginal gains of increasing classifier
complexity can often be minimal. In brief, the
advances in classification technology are
typically much less than is often claimed.
18
Dimension Reduction Methods(reading 3.6 to 3.8
in the text)
19
Dimension Reduction methods

Dimension reduction
From p-dimensional x to d-dimensional x , d lt
p
Techniques
Variable selection
use an algorithm to find individual variables in
x that are relevant to the problem and discard
the rest
stepwise logistic regression
Linear projections
Project data to a lower-dimensional space
E.g., principal components
Non-linear embedding
Use a non-linear mapping to embed data in a
lower-dimensional space
E.g., multidimensional scaling

20
Dimension Reduction why is it useful?

In general, incurs loss of information about x
So why do this?
If dimensionality p is very large (e.g., 1000s),
representing the data in a lower-dimensional
space may make learning more reliable,
e.g., clustering example
100 dimensional data
but cluster structure is only present in 2 of
the dimensions, the others are just noise
if other 98 dimensions are just noise (relative
to cluster structure), then clusters will be much
easier to discover if we just focus on the 2d
space
Dimension reduction can also provide
interpretation/insight
e.g for 2d visualization purposes
Caveat 2-step approach of dimension reduction
followed by learning algorithm is in general
suboptimal

21
Variable Selection Methods

p variables, would like to use a smaller subset
in our model
e.g., in classification, do kNN in d-space rather
than p-space
e.g., for logistic regresison, use d inputs
rather than p
Problem
Number of subsets of p variables is O(2p)
Exhaustive search is impossible except for very
small p
Typically the search problem is NP-hard
Common solution
Local systematic search (e.g., add/delete
variables 1 at a time) to locally maximize a
score function (i.e., hill-climbing)
e.g., add a variable, build new model, generate
new score, etc
Can often work well, but can get trapped in local
maxima/minima
Can also be computationally-intensive (depends on
model)
Note some techniques such as decision tree
predictors automatically perform dimension
reduction as part of the learning algorithm.

22
Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
23
Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
24
Projections to more than 1 dimension

Multidimensional projections
e.g., x is 4-dimensional
a1T 0.71 0.71 0 0
a2T 0 0 0.71 0.71
AT x -gt coordinates of x in 2d space
spanned by columns of A
-gt linear transformation from
4d to 2d space
where A a1 a2

25
Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
26
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
27
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
28
How do we compute the principal components?

See class notes
See also page 78 in the text

29
Notes on PCA

Basis representation and approximation error
Scree diagrams
Computational complexity of computing PCA
Equivalent to solving set of linear equations,
matrix inversion, singular value decomposition,
etc
Scales in general as O(np2 p3)
Many numerical tricks possible, e.g., for sparse
X matrices, for finding only the first k
eigenvectors, etc
In MATLAB can use eig.m or svd.m (also note
sparse versions)

30
More notes on PCA

Links between PCA and multivariate Gaussian
density
Caveat PCA can destroy information about
differences between groups for clustering or
classification
PCA for other data types
Images, e.g., eigenfaces
Text, e.g., latent semantic indexing (LSI)

31
Basis images (eigenimages) of faces
32
20 face images
33
First 16 Eigenimages
34
First 4 eigenimages
35
Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
36
Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
37
Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
38
Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1 -14.0
9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
39
Reconstructing Image 6 with 16 eigenimages
40
Multidimensional Scaling (MDS)

Say we have data in the form of an N x N matrix
of dissimilarities
0s on the diagonal
Symmetric
Could either be given data in this form, or
create such a dissimilarity matrix from our data
vectors
Examples
Perceptual dissimilarity of N objects in
cognitive science experiments
String-edit distance between N protein sequences
MDS
Find k-dimensional coordinates for each of the N
objects such that Euclidean distances in
embedded space matches set of dissimilarities
as closely as possible

41
Multidimensional Scaling (MDS)

MDS score function (stress)
N points embedded in k-dimensions -gt Nk locations
or parameters
To find the Nk locations?
Solve optimization problem -gt minimize S function
Often used for visualization, e.g., k2, 3

Euclidean distance in embedded k-dim space
Original dissimilarities
42
MDS Optimization

Optimization problem
S is a function of Nk parameters
find the set of N k-dimensional positions that
minimize S
Note 3 parameters are redundant location (2)
and rotation (1)
If original dissimilarities are Euclidean
-gt linear algebra solution (equivalent to
principal components)
Non-Euclidean (more typical)
Local iterative hill-climbing, e.g., move each
point to increase S, repeat
Non-trivial optimization, can have local minima,
etc
Initialization either random or heuristically
(e.g., by PCA)
Complexity is O(N2 k) per iteration (iteration
move all points locally)
See Faloutsos and Lin (1995) for FastMap O(Nk)
approximation for large N

43
MDS example input distance data
44
(No Transcript)
45
Result of MDS
46
MDS for protein sequences
MDS similarity matrix (note cluster structure)
MDS embedding
226 protein sequences of the Globin family (from
Klock Buhmann 1997).
47
MDS from human judgements of similarity
48
MDS Example data
49
MDS 2d embedding of face images
50
(No Transcript)
51
Other embedding techniques

Many other algorithms for non-linear embedding
Some of the better-known examples
Self-organizing maps (Kohonen)
Neural-network inspired algorithm for 2d
embedding
ISOMAP, Local linear embedding
Find low-d coordinates that preserve local
distances
Ignore global distances

52
Example of Local Linear Embedding (LLE)
(Roweis and Saul, Science, 2000)
Note how points that are far away on the 3d
manifold (e.g., red and blue) in manifold
distance would be mapped as being close together
by MDS or PCA but are kept far apart by LLE.
LLE emphasizes local relationships
53
LLE Algorithm

N points in dimension p wish to reduce to
dimension d, d lt p
Step 1
Select K nearest neighbors for each point in
training data
Represent each point as X a weighted linear
combination of its K neighbor points
Find best k weights for each of the X vectors
(least squares fitting)
Step 2
Fix the weights from part 1
For each p-dim vector X, find a d-dimensional Y
vector that is closest to its reconstructed
approximation based on d-dim neighbors and
weights
Reduces to another linear algebra/eigenvalue
problem, O(N3) complexity

54
(No Transcript)
55
LLE applied to text data
56
LLE applied to a set of face images
57
Local Linear Embedding example
58
ISOMAP Tenenbaum, de Silva, Langford, Science,
2000
Similar to LLE in concept preserves local
distances Computational strategy is different
measures distance in original space via
geodesic paths (distance on manifold)
Algorithm involves finding shortest paths between
points and then embedding
59
Examplesof ISOMAPembeddingsin 2d
60
ISOMAPmorphingexamples
61
Summary on Dimension Reduction

Can be used for defining a new set of
(lower-dimensional) variables for modeling or for
visualization/insight
3 general approaches
Variable selection (only select relevant
variables)
Linear projections (e.g., PCA)
Non-linear projections (e.g., MDS, LLE, ISOMAP)
Can be used with text, images, etc by
representing such data as very high-dimensional
vectors
MATLAB implementations of all of these techniques
are available on the Web
These techniques can be useful, but like any
high-powered tool they are not a solution to
everything
real-world data sets often do not produce the
types of elegant 2d embeddings that one often
sees in research papers on these topics.

Write a Comment

User Comments (0)

About PowerShow.com

ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data PowerPoint PPT Presentation