Clustering%20Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering%20Methods

Description:

... goodness of ?t to a nonmetric hypothesis, Psychometrika 29 (1964) 1 27. ... scaling with an unknown distance function, Psychometrika 27 (1962) 125 140. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 44
Provided by: csJoe
Category:

less

Transcript and Presenter's Notes

Title: Clustering%20Methods


1
Clustering Methods Part 6
Dimensionality
Ilja Sidoroff Pasi Fränti
Speech and Image Processing UnitDepartment of
Computer Science University of Joensuu, FINLAND
2
Dimensionality of data
  • Dimensionality of data set the minimum number
    of free variables needed to represent data
    without information loss
  • An d-attribute data set has an intrinsic
    dimensionality (ID) of M if its elements lie
    entirely within an M-dimensional subspace of Rd
    (M lt d)

3
Dimensionality of data
  • The use of more dimensions than necessary leads
    to problems
  • greater storage requirements
  • the speed of algorithms is slower
  • finding clusters and creating good classifiers is
    more difficult (curse of dimensionality)

4
Curse of dimensionality
  • When the dimensionality of space increases,
    distance measures become less useful
  • all points are more or less equidistant
  • most of the volume of a sphere is concentrated on
    a thin layer near the surface of the sphere (eg.
    next slide)

5
V(r) volume of sphere with radius r D
dimension of the sphere
6
Two approaches
  • Estimation of dimensionality
  • knowing ID of data set could help in tuning
    classification or clustering performance
  • Dimensionality reduction
  • projecting data to some subspace
  • eg. 2D/3D visualisation of multi-dimensional data
    set
  • may result in information loss if the subspace
    dimension is smaller than ID

7
Goodness of the projection
  • Can be estimated by two measures
  • Trustworthiness data points that are not
    neighbours in input space are not mapped as
    neighbours in output space.
  • Continuity data points that are close are not
    mapped far away in output space 11.

8
Trustworthiness
  • N - number of feature vectors
  • r(i,j) the rank of data sample j in the
    ordering according to the distance from i in the
    original data space
  • Uk(i) set of feature vectors that are in the
    size k-neighbourhood of sample i in the
    projection space but not in the original space
  • A(k) Scales the measure between 0 and 1

9
Continuity
  • r'(i,j) the rank of data sample j in the
    ordering according to the distance from i in the
    projection space
  • Vk(i) set of feature vectors that are in the
    size k-neighbourhood of sample i in the original
    space but not in the projection space

10
Example data sets
  • Swiss roll 20000 3D points
  • 2D manifold in 3D space
  • http//isomap.stanford.edu

11
Example data sets
  • 16 ? 16 pixel images of hands in different
    positions
  • Each image can be considered as 4096-dimensional
    data element
  • Could also be interpreted in terms of finger
    extension wrist rotation (2D)

12
Example data sets
http//isomap.stanford.edu
13
Synthetic data sets 11
S-shaped manifold
Sphere
Six clusters
14
Principal component analysis (PCA)
  • Idea find directions of maximal variance and
    align coordinate axis to them.
  • If variance is zero, that dimension is not
    needed.
  • Drawback works well only with linear data 1

15
PCA method (1/2)
  • Center data so that its means are zero
  • Calculate covariance matrix for data
  • Calculate eigenvalues and eigenvectors of the
    covariance matrix
  • Arrange eigenvectors according to the eigenvalues
  • For dimensionality reduction, choose the desired
    number of eigenvectors (2 or 3 for visualization)

16
PCA Method
  • Intrinsic dimensionality number of non-zero
    eigenvalues
  • Dimensionality reduction by projection yi
    Axi
  • Here xi is the input vector, yi the output
    vector, and A is the matrix containing
    eigenvectors corresponding to the largest
    eigenvalues.
  • For visualization typically 2 or 3 eigenvalues
    preserved.

17
Example of PCA
  • The distances between points are different in
    projections.
  • Test set c
  • two clusters are projected into one cluster
  • s-shaped cluster is projected nicely

18
Another example of PCA 10
  • Data set point lying on circle (x2 y2 1),
    ID 2
  • PCA yield two non-null eigenvalues
  • u, v principal components

19
Limitations of PCA
  • Since eigenvectors are orthogonal works well only
    with linear data
  • Tends to overestimate ID
  • Kernel PCA uses so called kernel trick to apply
    PCA also to non linear data
  • make non linear projection into a higher
    dimensional space, perform PCA analysis in this
    space

20
Multidimensional scaling method (MDS)
  • Project data into a new space while trying to
    preserve distances between data points
  • Define stress E (difference of pairwise distances
    in original and projection spaces)
  • E is minimized using some optimization algorithm
  • With certain stress functions (i.e. Kruskal) when
    E is 0, perfect projection exists
  • ID of the data is the smallest projection
    dimension where perfect projection exists

21
Metric MDS
  • The simplest stress function 2, raw stress

d(xi, xj) distance in the original space d(yi,
yj) distance in the projection space yi,
yj representation of xi, xj in output space
22
Sammon's Mapping
  • Sammon's mapping gives small distances a larger
    weight 5

23
Kruskal's stress
  • Ranking the point distances accounts for
    decreasing distances in lower dimensional
    projections

24
MDS example
  • Separates clusters better than PCA
  • Local structures are not always preserved
    (leftmost test set)

25
Other MDS approaches
  • ISOMAP 12
  • Curvilinear component analysis CCA 13

26
Local methods
  • Previous methods are global in the sense that the
    all input data is considered at once.
  • Local methods consider only some neighbourhood of
    data points ? may be computationally less
    demanding
  • Try to estimate topological dimension of the data
    manifold

27
Fukunaga-Olsen algorithm 6
  • Assume that data can be divided into small
    regions, i.e. clustered
  • Each cluster (voronoi set) of the data vector
    lies in an approximately linear surface gt PCA
    method can be applied to each cluster
  • Eigenvalues are normalized by diving by the
    largest eigenvalue

28
Fukunaga-Olsen algorithm
  • ID is defined as the number of normalized
    eigenvalues that are larger than a threshold T
  • Defining a good threshold is a problem as such

29
Near neighbour algorithm
  • Trunk's method 7
  • An initial value for an integer parameter k is
    chosen (usually k1).
  • k nearest neighbours for each data vector are
    identified.
  • for each data vector i, subspace spanned by
    vectors from i to each of its k neighbours is
    constructed.

30
Near neighbour algorithm
  • The angle between (k1)th near neighbour and its
    projection to the subspace is calculated for each
    data vector
  • If the average of these angles is below a
    threshold, ID is k, otherwise increase k and
    repeat the process

angle
subspace
31
Pseudocode
32
Near neighbour algorithm
  • It is not clear how to select suitable value for
    threshold
  • Improvements to Trunk's method
  • Pettis et al. 8
  • Verver-Duin 9

33
Fractal methods
  • Global methods, but different definition of
    dimensionality
  • Basic idea
  • count the observations inside a ball of radius r
    (f(r)).
  • analyse the growth rate of f(r)
  • if f grows as rk the dimensionality of data can
    be considered as k

34
Fractal methods
  • Dimensionality can be fractional, i.e. 1.5
  • So does not provide projections for lesser
    dimensional space (what is an R1,5 anyway?)
  • Fractal dimensionality estimate can be used in
    time-series analysis etc. 10

35
Fractal methods
  • Different definitions for fractal dimensions 10
  • Hausdorff dimension
  • Box-counting dimension
  • Correlation dimension
  • In order to get an accurate estimate of the
    dimension D, the data set cardinality must be at
    least 10D/2

36
Hausdorff dimension
  • data set is covered by cells si with variable
    diameter ri, all ri lt r
  • in other words, we look for collection of
    covering sets si with diameter less than or equal
    to r, which minimizes the sum
  • d-dimensional Hausdorff measure

37
Hausdorff dimension
  • For every data set GdH is infinite if d is less
    than some critical value DH, and 0 if d is
    greater than DH
  • The critical value DH is the Hausdorff dimension
    of the data set

38
Box-Counting dimension
  • Hausdorff dimension is not easy to calculate
  • Box-Counting DB dimension is an upper bound of
    Hausdorff dimension, does not usually differ from
    it

v(r) is the number of the boxes of size r
needed to cover the data set
39
Box-Counting dimension
  • Although Box-Counting dimension is easier to
    calculate than Hausdorff dimension, the
    algorithmic complexity grows exponentially with
    the set dimensionality gt can be used only for
    low-dimensional data sets
  • Correlation dimension is computationally more
    feasible fractal dimension measure
  • Correlation dimension is an lower bound of the
    Box-Counting dimension

40
Correlation dimension
  • Let x1, x2, x3, ... , xN be data points
  • Correlation integral can be defined as

I(x) is indicator function I(x) 1, iff x is
true, I(x) 0, otherwise.
41
Correlation dimension
  • (some explanation needed!!!)

42
Literature
  • M. Kirby, Geometric Data Analysis An Empirical
    Approach to Dimensionality Reduction and the
    Study of Patterns, John Wiley and Sons, 2001.
  • J. B. Kruskal, Multidimensional scaling by
    optimizing goodness of ?t to a nonmetric
    hypothesis, Psychometrika 29 (1964) 127.
  • R. N. Shepard, The analysis of proximities
    Multimensional scaling with an unknown distance
    function, Psychometrika 27 (1962) 125140.
  • R. S. Bennett, The intrinsic dimensionality of
    signal collections, IEEE Transactions on
    Information Theory 15 (1969) 517525.
  • J. W. J. Sammon, A nonlinear mapping for data
    structure analysis, IEEE Transaction on Computers
    C-18 (1969) 401409.
  • K. Fukunaga, D. R. Olsen, An algorithm for ?nding
    intrinsic dimensionality of data, IEEE
    Transactions on Computers 20 (2) (1976) 165171.
  • G. V. Trunk, Statistical estimation of the
    intrinsic dimensionality of a noisy signal
    collection, IEEE Transaction on Computers 25
    (1976) 165171.

43
Literature
  1. K. Pettis, T. Bailey, T. Jain, R. Dubes, An
    intrinsic dimensionality estimator from
    near-neighbor information, IEEE Transaction on
    Pattern Analysis and Machine Intelligence 1 (1)
    (1979) 2537.
  2. P. J. Verveer, R. Duin, An evaluation of
    intrinsic dimensionality estimators, IEEE
    Transaction on Pattern Analysis and Machine
    Intelligence 17 (1) (1995) 8186.
  3. F. Camastra, Data dimensionality estimation
    methods a survey, Pattern Recognition 36 (2003)
    2945-2954.
  4. J. Venna, Dimensionality reduction for visual
    exploration of similarity structures (2007), PhD
    thesis manuscript (submitted)
  5. J. B. Tenenbaum, V. de Silva, J. C. Langford, A
    global geometric framework for nonlinear
    dimensionality reduction, Science 290 (12) (2000)
    23192323.
  6. P. Demartines, J. Herault, Curvilinear component
    analysis A self-organizing neural network for
    nonlinear mapping in cluster analysis, IEEE
    Transactions on Neural Networks 8 (1) (1997)
    148154.
Write a Comment
User Comments (0)
About PowerShow.com