Neural Computation 0368-4149-01 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Neural Computation 0368-4149-01

Description:

For each vector component X(j)T ... Vector of mean feature values. ... Report on the state-of-the-art. 1 June 1998. Begin software implementation 15 June 1998 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 30
Provided by: hin93
Category:

less

Transcript and Presenter's Notes

Title: Neural Computation 0368-4149-01


1
Neural Computation 0368-4149-01
  • Prof. Nathan Intrator
  • Tuesday 1600-1900 Schreiber 007
  • Office hours Wed 4-5
  • nin_at_tau.ac.il

2
Outline
  • Goals for neural learning - Unsupervised
  • Goals for statistical/computational learning
  • PCA
  • ICA
  • Exploratory Projection Pursuit
  • Search for non-Gaussian distributions
  • Practical implementations

3
Statistical Approach to Unsupervised Learning
  • Understanding the nature of data variability
  • Modeling the data (sometimes very flexible model)
  • Understanding the nature of the noise
  • Applying prior knowledge
  • Extracting features based on
  • Prior knowledge
  • Class prediction
  • Unsupervised learning

4
  • Principal Component Analysis.
  • Wlodzislaw Duch
  • SCE, NTU, Singapore
  • http//www.ntu.edu.sg/home/aswduch

5
Linear transformations example
  • 2D vectors X in a unit circle with mean (1,1) Y
    AX, A 2x2 matrix

The shape is elongated, rotated and the mean is
shifted.
6
Invariant distances
  • Euclidean distance is not invariant to general
    linear transformations

This is invariant only for orthonormal matrices
ATA I that make rigid rotations, without
stretching or shrinking distances. Idea
standardize the data in some way to create
invariant distances.
7
Data standardization
  • For each vector component X(j)T(X1(j), ...
    Xd(j)), j1 .. n
  • calculate mean and std n number of vectors, d
    their dimension

Vector of mean feature values. Averages over
rows.
8
Standard deviation
  • Calculate standard deviation

Vector of mean feature values. Variance
square of standard deviation (std), sum of all
deviations from the mean value.
Transform X gt Z, standardized data vectors
9
Std data
  • Std data zero mean and unit variance.

Standardize data after making data
transformation. Effect data is invariant to
scaling only (diagonal transformation). Distances
are invariant, data distribution is the same.
How to make data invariant to any linear
transformations?
10
Data standardization example
  • For our example YAX, assuming X means1 and
    variances 1

Transformation Vector of mean feature values.
Variance check it!
How to make this invariant?
11
Covariance matrix
  • Variance (spread around mean value) correlation
    between features.

CX is d x d
where X is d x n dimensional matrix of vectors
shifted to their means. Covariance matrix is
symmetric Cij Cji and positive definite.
Diagonal elements are variances (square of std),
si2 Cii
Spherical distribution of data has CijI (unit
matrix). Elongated ellipsoids large
off-diagonal elements, strong correlations
between features.
12
Mahalanobis distance
  • Linear combinations of features leads to
    rotations and scaling of data.

Mahalanobis distance is invariant to linear
transformations
13
Principal components
  • How to avoid correlated features?
  • Correlations ? covariance matrix is non-diagonal
    !
  • Solution diagonalize it, then use transformation
    that makes it
  • diagonal to de-correlate features.

In matrix form, X, Y are dxn, Z, CX, CY are dxd
C symmetric, positive definite matrix XTCX gt 0
for Xgt0 its eigenvectors are
orthonormal its eigenvalues are all
non-negative Z matrix of orthonormal
eigenvectors (because Z is realsymmetric),
transforms X into Y, with diagonal CY, i.e.
decorrelated.
14
Matrix form
  • Eigen problem for C matrix in matrix form

15
Principal components
  • PCA old idea, C. Pearson (1901), H. Hotelling
    1933

Y principal components, or vectors X
transformed using eigenvectors of CX Covariance
matrix of transformed vectors is diagonal gt
ellipsoidal distribution of data.
Result PC are linear combinations of all
features, providing new uncorrelated features,
with diagonal covariance matrix eigenvalues.
Small li ? small variance ? data change little in
direction Yi PCA minimizes C matrix
reconstruction errors Zi vectors for large li
are sufficient to get because vectors for small
eigenvalues will have very small contribution to
the covariance matrix.
16
Two components for visualization
Diagonalization methods see Numerical Recipes,
www.nr.com
  • New coordinate system axis ordered according to
    variance size of the eigenvalue.
  • First k dimensions account for

fraction of all variance (please note that li are
variances) frequently 80-90 is sufficient for
rough description.
17
PCA properties
  • PC Analysis (PCA) may be achieved by
  • transformation making covariance matrix diagonal
  • projecting the data on a line for which the sums
    of squares of distances from original points to
    projections is minimal.
  • orthogonal transformation to new variables that
    have stationary variances

True covariance matrices are usually not known,
estimated from data. This works well on
single-cluster data more complex structure may
require local PCA, separately for each cluster.
PC is useful for finding new, more
informative, uncorrelated features reducing
dimensionality reject low variance
features, reconstructing covariance matrices
from low-dim data.
18
PCA Wisconsin example
  • Wisconsin Breast Cancer data
  • Collected at the University of Wisconsin
    Hospitals, USA.
  • 699 cases, 458 (65.5) benign (red), 241
    malignant (green).
  • 9 features quantized 1, 2 .. 10, cell
    properties, ex
  • Clump Thickness, Uniformity of Cell Size, Shape,
    Marginal Adhesion, Single Epithelial Cell Size,
    Bare Nuclei,

Bland Chromatin, Normal Nucleoli, Mitoses. 2D
scatterograms do not show any structure no matter
which subspaces are taken!
19
Example cont.
  • PC gives useful information already in 2D.

Taking first PCA component of the standardized
data If (Y1gt0.41) then benign else
malignant 18 errors/699 cases
97.4 Transformed vectors are not standardized,
stds are below.
Eigenvalues converge slowly, but classes
are separated well.
20
PCA disadvantages
  • Useful for dimensionality reduction but
  • Largest variance determines which components are
    used, but does not guarantee interesting
    viewpoint for clustering data.
  • The meaning of features is lost when linear
    combinations are formed.

Analysis of coefficients in Z1 and other
important eigenvectors may show which original
features are given much weight. PCA may be also
done in an efficient way by performing singular
value decomposition of the standardized data
matrix. PCA is also called Karhuen-Loève
transformation. Many variants of PCA are
described in A. Webb, Statistical pattern
recognition, J. Wiley 2002.
21
2 skewed distributions
  • PCA transformation for 2D data

First component will be chosen along the largest
variance line, both clusters will strongly
overlap, no interesting structure will be
visible. In fact projection to orthogonal axis
to the first PCA component has much more
discriminating power. Discriminant coordinates
should be used to reveal class structure.
22

High Dimensional Data
Dimension Reduction
Visualisation Classification Analysis
Feature Extraction
23
  • Projection Pursuit
  • what An automated procedure that seeks
    interesting low dimensional projections of a
    high dimensional cloud by numerically
    maximizing an objective function or projection
    index.

  • Huber, 1985

24
  • Projection Pursuit
  • why
  • Curse of dimensionality
  • Less Robustness
  • worse mean squared error
  • greater computational cost
  • slower convergence to limiting distributions
  • Required number of labelled samples increases
    with dimensionality.

25
What is an interesting projection
  • In general the projection that reveals more
    information about the structure.
  • In pattern recognition
  • a projection that maximises class
    separability in a low dimensional
    subspace.

26
  • Projection Pursuit
  • Dimensional Reduction
  • Find lower-dimensional projections of a
    high-dimensional point cloud to facilitate
    classification.
  • Exploratory Projection Pursuit
  • Reduce the dimension of the problem to facilitate
    visualization.

27
  • Projection Pursuit
  • How many dimensions to use
  • for visualization
  • for classification/analysis
  • Which Projection Index to use
  • measure of variation (Principal Components)
  • departure from normality (negative entropy)
  • class separability(distance, Bhattacharyya,
    Mahalanobis, ...)

28
  • Projection Pursuit
  • Which optimization method to choose
  • We are trying to find the global optimum among
    local ones
  • hill climbing methods (simulated annealing)
  • regular optimization routines with random
    starting points.

29
  • Timetable for Dimensionality reduction
  • Begin 16 April 1998
  • Report on the state-of-the-art. 1 June 1998
  • Begin software implementation 15 June 1998
  • Prototype software presentation 1 November 1998
Write a Comment
User Comments (0)
About PowerShow.com