ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data PowerPoint PPT Presentation

presentation player overlay
1 / 61
About This Presentation
Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 5: LowDimensional Representations of HighDimensional Data


1
ICS 278 Data MiningLecture 5 Low-Dimensional
Representations of High-Dimensional Data

2
Todays lecture
  • Extend project proposal deadline to Monday, 8am
    questions?
  • Outline of todays lecture
  • orphan slides from earlier lectures
  • Dimension reduction methods
  • Motivation
  • Variable selection methods
  • Linear projection techniques
  • Non-linear embedding methods

3
Notation Reminder
  • n objects, each with p measurements
  • data vector for ith object
  • Data matrix
  • is the ith row, jth column
  • columns -gt variables
  • rows -gt data points
  • Can define distances/similarities
  • between rows (data vectors i)
  • between columns (variables j)

4
Distance
  • Makes sense in the case where the different
    measurements are commensurate each variable
    measured in the same units.
  • If the measurements are different, say length
    and weight, Euclidean distance is not necessarily
    meaningful.

5
Dependence among Variables
  • Covariance and correlation measure linear
    dependence (distance between variables, not
    objects)
  • Assume we have two variables or attributes X and
    Y and n objects taking on values x(1), , x(n)
    and y(1), , y(n). The sample covariance of X
    and Y is
  • The covariance is a measure of how X and Y vary
    together.
  • it will be large and positive if large values of
    X are associated with large values of Y, and
    small X ? small Y

6
Correlation coefficient
  • Covariance depends on ranges of X and Y
  • Standardize by dividing by standard deviation
  • Linear correlation coefficient is defined as

7
Sample Correlation Matrix
-1
0
1
average rooms
Data on characteristics of Boston surburbs
Median house value
8
Mahalanobis distance (between objects)
Vector difference in p-dimensional space
Evaluates to a scalar distance
Inverse covariance matrix
  • It automatically accounts for the scaling of the
    coordinate axes
  • It corrects for correlation between the different
    features
  • Cost
  • The covariance matrices can be hard to determine
    accurately
  • The memory and time requirements grow
    quadratically, O(p2), rather than linearly with
    the number of features.

9
Example 1 of Mahalanobis distance
Covariance matrix is diagonal and isotropic -gt
all dimensions have equal variance -gt MH
distance reduces to Euclidean distance
10
Example 2 of Mahalanobis distance
Covariance matrix is diagonal but
non-isotropic -gt dimensions do not have equal
variance -gt MH distance reduces to weighted
Euclidean distance with weights inverse
variance
11
Example 2 of Mahalanobis distance
Two outer blue points will have same MH distance
to the center blue point
12
Distances between Binary Vectors
Number of variables where item j 1 and item i 0
  • matching coefficient

13
Other distance metrics
  • Categorical variables
  • Number of matches divided by number of dimensions
  • Distances between strings of different lengths
  • e.g., Patrick J. Smyth and Padhraic Smyth
  • Edit distance
  • Distances between images and waveforms
  • Shift-invariant, scale invariant
  • i.e., d(x,y) min_a,b ( (axb) y)
  • More generally, kernel methods

14
Transforming Data
  • Duality between form of the data and the model
  • Useful to bring data onto a natural scale
  • Some variables are very skewed, e.g., income
  • Common transforms square root, reciprocal,
    logarithm, raising to a power
  • Often very useful when dealing with skewed
    real-world data
  • Logit transforms from 0 to 1 to real-line

15
Data Quality
  • Individual measurements
  • Random noise in individual measurements
  • Variance (precision)
  • Bias
  • Random data entry errors
  • Noise in label assignment (e.g., class labels in
    medical data sets)
  • Systematic errors
  • E.g., all ages gt 99 recorded as 99
  • More individuals aged 20, 30, 40, etc than
    expected
  • Missing information
  • Missing at random
  • Questions on a questionnaire that people randomly
    forget to fill in
  • Missing systematically
  • Questions that people dont want to answer
  • Patients who are too ill for a certain test

16
Data Quality
  • Collections of measurements
  • Ideal case random sample from population of
    interest
  • Real case often a biased sample of some sort
  • Key point patterns or models built on the
    training data may only be valid on future data
    that comes from the same distribution
  • Examples of non-randomly sampled data
  • Medical study where subjects are all students
  • Geographic dependencies
  • Temporal dependencies
  • Stratified samples
  • E.g., 50 healthy, 50 ill
  • Hidden systematic effects
  • E.g., market basket data the weekend of a large
    sale in the store
  • E.g., Web log data during finals week

17
Classifier technology and the illusion of
progress (abstract for workshop on
State-of-theArt in Supervised Classification, May
2006) Professor David J. Hand, Imperial College,
London Supervised classification methods are
widely used in data mining. Highly sophisticated
methods have been developed, using the full power
of recent advances in computation. However, these
advances have largely taken place within the
context of a classical paradigm, in which
construction of the classification rule is based
on a design sample of data randomly sampled
from unknown but well defined distributions of
the classes. In this paper, I argue that this
paradigm fails to take account of other sources
of uncertainty in the classification problem, and
that these other sources lead to uncertainties
which often swamp those arising from the
classical ones of estimation and prediction.
Several examples of such sources are given,
including imprecision in the definitions of the
classes, sample selectivity bias, population
drift, and use of inappropriate optimisation
criteria when fitting the model. Furthermore, it
is argued, there are both theoretical arguments
and practical evidence supporting the assertion
that the marginal gains of increasing classifier
complexity can often be minimal. In brief, the
advances in classification technology are
typically much less than is often claimed.
18
Dimension Reduction Methods(reading 3.6 to 3.8
in the text)
19
Dimension Reduction methods
  • Dimension reduction
  • From p-dimensional x to d-dimensional x , d lt
    p
  • Techniques
  • Variable selection
  • use an algorithm to find individual variables in
    x that are relevant to the problem and discard
    the rest
  • stepwise logistic regression
  • Linear projections
  • Project data to a lower-dimensional space
  • E.g., principal components
  • Non-linear embedding
  • Use a non-linear mapping to embed data in a
    lower-dimensional space
  • E.g., multidimensional scaling

20
Dimension Reduction why is it useful?
  • In general, incurs loss of information about x
  • So why do this?
  • If dimensionality p is very large (e.g., 1000s),
    representing the data in a lower-dimensional
    space may make learning more reliable,
  • e.g., clustering example
  • 100 dimensional data
  • but cluster structure is only present in 2 of
    the dimensions, the others are just noise
  • if other 98 dimensions are just noise (relative
    to cluster structure), then clusters will be much
    easier to discover if we just focus on the 2d
    space
  • Dimension reduction can also provide
    interpretation/insight
  • e.g for 2d visualization purposes
  • Caveat 2-step approach of dimension reduction
    followed by learning algorithm is in general
    suboptimal

21
Variable Selection Methods
  • p variables, would like to use a smaller subset
    in our model
  • e.g., in classification, do kNN in d-space rather
    than p-space
  • e.g., for logistic regresison, use d inputs
    rather than p
  • Problem
  • Number of subsets of p variables is O(2p)
  • Exhaustive search is impossible except for very
    small p
  • Typically the search problem is NP-hard
  • Common solution
  • Local systematic search (e.g., add/delete
    variables 1 at a time) to locally maximize a
    score function (i.e., hill-climbing)
  • e.g., add a variable, build new model, generate
    new score, etc
  • Can often work well, but can get trapped in local
    maxima/minima
  • Can also be computationally-intensive (depends on
    model)
  • Note some techniques such as decision tree
    predictors automatically perform dimension
    reduction as part of the learning algorithm.

22
Linear Projections
x p-dimensional vector of data measurements
Let a weight vector, also dimension p Assume
aT a 1 (i.e., unit norm) aT x S aj xj
projection of x onto vector a,
gives distance of projected x along a e.g., aT
1 0 -gt projection along 1st
dimension aT 0 1 -gt projection
along 2nd dimension aT 0.71, 0.71 -gt
projection along diagonal
23
Example of projection from 2d to 1d
x2
Direction of weight vector a
x1
24
Projections to more than 1 dimension
  • Multidimensional projections
  • e.g., x is 4-dimensional
  • a1T 0.71 0.71 0 0
  • a2T 0 0 0.71 0.71
  • AT x -gt coordinates of x in 2d space
    spanned by columns of A
  • -gt linear transformation from
    4d to 2d space
  • where A a1 a2

25
Principal Components Analysis (PCA)
X p x n data matrix columns p-dim data
vectors Let a weight vector, also dimension
p Assume aT a 1 (i.e., unit norm) aT X
projection of each column x onto vector a,
vector of distances of projected x vectors
along a PCA find vector a such that var(aT X )
is maximized i.e., find linear projection
with maximal variance More generally ATX d x
n data matrix with x vectors projected to
d-dimensional space, where size(A) p
x d PCA find d orthogonal columns of A such
that variance in the d-dimensional
projected space is maximized, d lt p
26
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
27
PCA Example
Direction of 1st principal component vector
(highest variance projection)
x2
x1
Direction of 2nd principal component vector
28
How do we compute the principal components?
  • See class notes
  • See also page 78 in the text

29
Notes on PCA
  • Basis representation and approximation error
  • Scree diagrams
  • Computational complexity of computing PCA
  • Equivalent to solving set of linear equations,
    matrix inversion, singular value decomposition,
    etc
  • Scales in general as O(np2 p3)
  • Many numerical tricks possible, e.g., for sparse
    X matrices, for finding only the first k
    eigenvectors, etc
  • In MATLAB can use eig.m or svd.m (also note
    sparse versions)

30
More notes on PCA
  • Links between PCA and multivariate Gaussian
    density
  • Caveat PCA can destroy information about
    differences between groups for clustering or
    classification
  • PCA for other data types
  • Images, e.g., eigenfaces
  • Text, e.g., latent semantic indexing (LSI)

31
Basis images (eigenimages) of faces
32
20 face images
33
First 16 Eigenimages
34
First 4 eigenimages
35
Reconstruction of First Image with 8 eigenimages
Reconstructed Image
Original Image
36
Reconstruction of first image with 8 eigenimages
Weights -14.0 9.4 -1.1 -3.5 -9.8
-3.5 -0.6 0.6 Reconstructed image
weighted sum of 8 images on left
37
Reconstruction of 7th image with eigenimages
Reconstructed Image
Original Image
38
Reconstruction of 7th image with 8 eigenimages
Weights -13.7 12.9 1.6 4.4 3.0
0.9 1.6 -6.3 Weights for Image 1 -14.0
9.4 -1.1 -3.5 -9.8 -3.5 -0.6
0.6 Reconstructed image weighted sum of 8
images on left
39
Reconstructing Image 6 with 16 eigenimages
40
Multidimensional Scaling (MDS)
  • Say we have data in the form of an N x N matrix
    of dissimilarities
  • 0s on the diagonal
  • Symmetric
  • Could either be given data in this form, or
    create such a dissimilarity matrix from our data
    vectors
  • Examples
  • Perceptual dissimilarity of N objects in
    cognitive science experiments
  • String-edit distance between N protein sequences
  • MDS
  • Find k-dimensional coordinates for each of the N
    objects such that Euclidean distances in
    embedded space matches set of dissimilarities
    as closely as possible

41
Multidimensional Scaling (MDS)
  • MDS score function (stress)
  • N points embedded in k-dimensions -gt Nk locations
    or parameters
  • To find the Nk locations?
  • Solve optimization problem -gt minimize S function
  • Often used for visualization, e.g., k2, 3

Euclidean distance in embedded k-dim space
Original dissimilarities
42
MDS Optimization
  • Optimization problem
  • S is a function of Nk parameters
  • find the set of N k-dimensional positions that
    minimize S
  • Note 3 parameters are redundant location (2)
    and rotation (1)
  • If original dissimilarities are Euclidean
  • -gt linear algebra solution (equivalent to
    principal components)
  • Non-Euclidean (more typical)
  • Local iterative hill-climbing, e.g., move each
    point to increase S, repeat
  • Non-trivial optimization, can have local minima,
    etc
  • Initialization either random or heuristically
    (e.g., by PCA)
  • Complexity is O(N2 k) per iteration (iteration
    move all points locally)
  • See Faloutsos and Lin (1995) for FastMap O(Nk)
    approximation for large N

43
MDS example input distance data
44
(No Transcript)
45
Result of MDS
46
MDS for protein sequences
MDS similarity matrix (note cluster structure)
MDS embedding
226 protein sequences of the Globin family (from
Klock Buhmann 1997).
47
MDS from human judgements of similarity
48
MDS Example data
49
MDS 2d embedding of face images
50
(No Transcript)
51
Other embedding techniques
  • Many other algorithms for non-linear embedding
  • Some of the better-known examples
  • Self-organizing maps (Kohonen)
  • Neural-network inspired algorithm for 2d
    embedding
  • ISOMAP, Local linear embedding
  • Find low-d coordinates that preserve local
    distances
  • Ignore global distances

52
Example of Local Linear Embedding (LLE)
(Roweis and Saul, Science, 2000)
Note how points that are far away on the 3d
manifold (e.g., red and blue) in manifold
distance would be mapped as being close together
by MDS or PCA but are kept far apart by LLE.
LLE emphasizes local relationships
53
LLE Algorithm
  • N points in dimension p wish to reduce to
    dimension d, d lt p
  • Step 1
  • Select K nearest neighbors for each point in
    training data
  • Represent each point as X a weighted linear
    combination of its K neighbor points
  • Find best k weights for each of the X vectors
    (least squares fitting)
  • Step 2
  • Fix the weights from part 1
  • For each p-dim vector X, find a d-dimensional Y
    vector that is closest to its reconstructed
    approximation based on d-dim neighbors and
    weights
  • Reduces to another linear algebra/eigenvalue
    problem, O(N3) complexity

54
(No Transcript)
55
LLE applied to text data
56
LLE applied to a set of face images
57
Local Linear Embedding example
58
ISOMAP Tenenbaum, de Silva, Langford, Science,
2000
Similar to LLE in concept preserves local
distances Computational strategy is different
measures distance in original space via
geodesic paths (distance on manifold)
Algorithm involves finding shortest paths between
points and then embedding
59
Examplesof ISOMAPembeddingsin 2d
60
ISOMAPmorphingexamples
61
Summary on Dimension Reduction
  • Can be used for defining a new set of
    (lower-dimensional) variables for modeling or for
    visualization/insight
  • 3 general approaches
  • Variable selection (only select relevant
    variables)
  • Linear projections (e.g., PCA)
  • Non-linear projections (e.g., MDS, LLE, ISOMAP)
  • Can be used with text, images, etc by
    representing such data as very high-dimensional
    vectors
  • MATLAB implementations of all of these techniques
    are available on the Web
  • These techniques can be useful, but like any
    high-powered tool they are not a solution to
    everything
  • real-world data sets often do not produce the
    types of elegant 2d embeddings that one often
    sees in research papers on these topics.
Write a Comment
User Comments (0)
About PowerShow.com