Title: ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data
1ICS 278 Data MiningLecture 5 Low-Dimensional
Representations of High-Dimensional Data
2Todays lecture
- Extend project proposal deadline to Monday, 8am
questions? - Outline of todays lecture
- orphan slides from earlier lectures
- Dimension reduction methods
- Motivation
- Variable selection methods
- Linear projection techniques
- Non-linear embedding methods
3Notation Reminder
- n objects, each with p measurements
- data vector for ith object
- Data matrix
- is the ith row, jth column
- columns -gt variables
- rows -gt data points
- Can define distances/similarities
- between rows (data vectors i)
- between columns (variables j)
4Distance
- Makes sense in the case where the different
measurements are commensurate each variable
measured in the same units. - If the measurements are different, say length
and weight, Euclidean distance is not necessarily
meaningful.
5Dependence among Variables
- Covariance and correlation measure linear
dependence (distance between variables, not
objects) - Assume we have two variables or attributes X and
Y and n objects taking on values x(1), , x(n)
and y(1), , y(n). The sample covariance of X
and Y is - The covariance is a measure of how X and Y vary
together. - it will be large and positive if large values of
X are associated with large values of Y, and
small X ? small Y
6Correlation coefficient
- Covariance depends on ranges of X and Y
- Standardize by dividing by standard deviation
- Linear correlation coefficient is defined as
7Sample Correlation Matrix
-1
0
1
average rooms
Data on characteristics of Boston surburbs
Median house value
8Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix
- It automatically accounts for the scaling of the
coordinate axes - It corrects for correlation between the different
features - Cost
- The covariance matrices can be hard to determine
accurately - The memory and time requirements grow
quadratically, O(p2), rather than linearly with
the number of features.
9Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic -gt
all dimensions have equal variance -gt MH
distance reduces to Euclidean distance
10Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic -gt dimensions do not have equal
variance -gt MH distance reduces to weighted
Euclidean distance with weights inverse
variance
11Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
12Distances between Binary Vectors
Number of variables where item j 1 and item i 0
13Other distance metrics
- Categorical variables
- Number of matches divided by number of dimensions
- Distances between strings of different lengths
- e.g., Patrick J. Smyth and Padhraic Smyth
- Edit distance
- Distances between images and waveforms
- Shift-invariant, scale invariant
- i.e., d(x,y) min_a,b ( (axb) y)
- More generally, kernel methods
14Transforming Data
- Duality between form of the data and the model
- Useful to bring data onto a natural scale
- Some variables are very skewed, e.g., income
- Common transforms square root, reciprocal,
logarithm, raising to a power - Often very useful when dealing with skewed
real-world data - Logit transforms from 0 to 1 to real-line
15Data Quality
- Individual measurements
- Random noise in individual measurements
- Variance (precision)
- Bias
- Random data entry errors
- Noise in label assignment (e.g., class labels in
medical data sets) - Systematic errors
- E.g., all ages gt 99 recorded as 99
- More individuals aged 20, 30, 40, etc than
expected - Missing information
- Missing at random
- Questions on a questionnaire that people randomly
forget to fill in - Missing systematically
- Questions that people dont want to answer
- Patients who are too ill for a certain test
16Data Quality
- Collections of measurements
- Ideal case random sample from population of
interest - Real case often a biased sample of some sort
- Key point patterns or models built on the
training data may only be valid on future data
that comes from the same distribution - Examples of non-randomly sampled data
- Medical study where subjects are all students
- Geographic dependencies
- Temporal dependencies
- Stratified samples
- E.g., 50 healthy, 50 ill
- Hidden systematic effects
- E.g., market basket data the weekend of a large
sale in the store - E.g., Web log data during finals week
17Classifier technology and the illusion of
progress (abstract for workshop on
State-of-theArt in Supervised Classification, May
2006) Professor David J. Hand, Imperial College,
London Supervised classification methods are
widely used in data mining. Highly sophisticated
methods have been developed, using the full power
of recent advances in computation. However, these
advances have largely taken place within the
context of a classical paradigm, in which
construction of the classification rule is based
on a design sample of data randomly sampled
from unknown but well defined distributions of
the classes. In this paper, I argue that this
paradigm fails to take account of other sources
of uncertainty in the classification problem, and
that these other sources lead to uncertainties
which often swamp those arising from the
classical ones of estimation and prediction.
Several examples of such sources are given,
including imprecision in the definitions of the
classes, sample selectivity bias, population
drift, and use of inappropriate optimisation
criteria when fitting the model. Furthermore, it
is argued, there are both theoretical arguments
and practical evidence supporting the assertion
that the marginal gains of increasing classifier
complexity can often be minimal. In brief, the
advances in classification technology are
typically much less than is often claimed.
18Dimension Reduction Methods(reading 3.6 to 3.8
in the text)
19Dimension Reduction methods
- Dimension reduction
- From p-dimensional x to d-dimensional x , d lt
p - Techniques
- Variable selection
- use an algorithm to find individual variables in
x that are relevant to the problem and discard
the rest - stepwise logistic regression
- Linear projections
- Project data to a lower-dimensional space
- E.g., principal components
- Non-linear embedding
- Use a non-linear mapping to embed data in a
lower-dimensional space - E.g., multidimensional scaling
20Dimension Reduction why is it useful?
- In general, incurs loss of information about x
- So why do this?
- If dimensionality p is very large (e.g., 1000s),
representing the data in a lower-dimensional
space may make learning more reliable, - e.g., clustering example
- 100 dimensional data
- but cluster structure is only present in 2 of
the dimensions, the others are just noise - if other 98 dimensions are just noise (relative
to cluster structure), then clusters will be much
easier to discover if we just focus on the 2d
space - Dimension reduction can also provide
interpretation/insight - e.g for 2d visualization purposes
- Caveat 2-step approach of dimension reduction
followed by learning algorithm is in general
suboptimal
21Variable Selection Methods
- p variables, would like to use a smaller subset
in our model - e.g., in classification, do kNN in d-space rather
than p-space - e.g., for logistic regresison, use d inputs
rather than p - Problem
- Number of subsets of p variables is O(2p)
- Exhaustive search is impossible except for very
small p - Typically the search problem is NP-hard
- Common solution
- Local systematic search (e.g., add/delete
variables 1 at a time) to locally maximize a
score function (i.e., hill-climbing) - e.g., add a variable, build new model, generate
new score, etc - Can often work well, but can get trapped in local
maxima/minima - Can also be computationally-intensive (depends on
model) - Note some techniques such as decision tree
predictors automatically perform dimension
reduction as part of the learning algorithm.
22Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
23Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
24Projections to more than 1 dimension
- Multidimensional projections
- e.g., x is 4-dimensional
- a1T 0.71 0.71 0 0
- a2T 0 0 0.71 0.71
- AT x -gt coordinates of x in 2d space
spanned by columns of A - -gt linear transformation from
4d to 2d space - where A a1 a2
25Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
26PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
27PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
28How do we compute the principal components?
- See class notes
- See also page 78 in the text
29Notes on PCA
- Basis representation and approximation error
- Scree diagrams
- Computational complexity of computing PCA
- Equivalent to solving set of linear equations,
matrix inversion, singular value decomposition,
etc - Scales in general as O(np2 p3)
- Many numerical tricks possible, e.g., for sparse
X matrices, for finding only the first k
eigenvectors, etc - In MATLAB can use eig.m or svd.m (also note
sparse versions)
30More notes on PCA
- Links between PCA and multivariate Gaussian
density - Caveat PCA can destroy information about
differences between groups for clustering or
classification - PCA for other data types
- Images, e.g., eigenfaces
- Text, e.g., latent semantic indexing (LSI)
31Basis images (eigenimages) of faces
3220 face images
33First 16 Eigenimages
34First 4 eigenimages
35Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
36Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
37Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
38Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1 -14.0
9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
39Reconstructing Image 6 with 16 eigenimages
40Multidimensional Scaling (MDS)
- Say we have data in the form of an N x N matrix
of dissimilarities - 0s on the diagonal
- Symmetric
- Could either be given data in this form, or
create such a dissimilarity matrix from our data
vectors - Examples
- Perceptual dissimilarity of N objects in
cognitive science experiments - String-edit distance between N protein sequences
- MDS
- Find k-dimensional coordinates for each of the N
objects such that Euclidean distances in
embedded space matches set of dissimilarities
as closely as possible
41Multidimensional Scaling (MDS)
- MDS score function (stress)
- N points embedded in k-dimensions -gt Nk locations
or parameters - To find the Nk locations?
- Solve optimization problem -gt minimize S function
- Often used for visualization, e.g., k2, 3
Euclidean distance in embedded k-dim space
Original dissimilarities
42MDS Optimization
- Optimization problem
- S is a function of Nk parameters
- find the set of N k-dimensional positions that
minimize S - Note 3 parameters are redundant location (2)
and rotation (1) - If original dissimilarities are Euclidean
- -gt linear algebra solution (equivalent to
principal components) - Non-Euclidean (more typical)
- Local iterative hill-climbing, e.g., move each
point to increase S, repeat - Non-trivial optimization, can have local minima,
etc - Initialization either random or heuristically
(e.g., by PCA) - Complexity is O(N2 k) per iteration (iteration
move all points locally) - See Faloutsos and Lin (1995) for FastMap O(Nk)
approximation for large N
43MDS example input distance data
44(No Transcript)
45Result of MDS
46MDS for protein sequences
MDS similarity matrix (note cluster structure)
MDS embedding
226 protein sequences of the Globin family (from
Klock Buhmann 1997).
47MDS from human judgements of similarity
48MDS Example data
49MDS 2d embedding of face images
50(No Transcript)
51Other embedding techniques
- Many other algorithms for non-linear embedding
- Some of the better-known examples
- Self-organizing maps (Kohonen)
- Neural-network inspired algorithm for 2d
embedding - ISOMAP, Local linear embedding
- Find low-d coordinates that preserve local
distances - Ignore global distances
-
52Example of Local Linear Embedding (LLE)
(Roweis and Saul, Science, 2000)
Note how points that are far away on the 3d
manifold (e.g., red and blue) in manifold
distance would be mapped as being close together
by MDS or PCA but are kept far apart by LLE.
LLE emphasizes local relationships
53LLE Algorithm
- N points in dimension p wish to reduce to
dimension d, d lt p - Step 1
- Select K nearest neighbors for each point in
training data - Represent each point as X a weighted linear
combination of its K neighbor points - Find best k weights for each of the X vectors
(least squares fitting) - Step 2
- Fix the weights from part 1
- For each p-dim vector X, find a d-dimensional Y
vector that is closest to its reconstructed
approximation based on d-dim neighbors and
weights - Reduces to another linear algebra/eigenvalue
problem, O(N3) complexity
54(No Transcript)
55LLE applied to text data
56LLE applied to a set of face images
57Local Linear Embedding example
58ISOMAP Tenenbaum, de Silva, Langford, Science,
2000
Similar to LLE in concept preserves local
distances Computational strategy is different
measures distance in original space via
geodesic paths (distance on manifold)
Algorithm involves finding shortest paths between
points and then embedding
59Examplesof ISOMAPembeddingsin 2d
60ISOMAPmorphingexamples
61Summary on Dimension Reduction
- Can be used for defining a new set of
(lower-dimensional) variables for modeling or for
visualization/insight - 3 general approaches
- Variable selection (only select relevant
variables) - Linear projections (e.g., PCA)
- Non-linear projections (e.g., MDS, LLE, ISOMAP)
- Can be used with text, images, etc by
representing such data as very high-dimensional
vectors - MATLAB implementations of all of these techniques
are available on the Web - These techniques can be useful, but like any
high-powered tool they are not a solution to
everything - real-world data sets often do not produce the
types of elegant 2d embeddings that one often
sees in research papers on these topics.