Ch. 10. Unsupervised Learning and Clustering - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Ch. 10. Unsupervised Learning and Clustering

Description:

One might wish to proceed in the reverse direction. Data mining applications: the contents of a large database are not known beforehand ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 43
Provided by: bmeYon
Category:

less

Transcript and Presenter's Notes

Title: Ch. 10. Unsupervised Learning and Clustering


1
Ch. 10. Unsupervised Learning and Clustering
2
Introduction
  • Supervised vs. unsupervised
  • Why unsupervised classification?
  • Collecting and labeling a large set of sample
    patterns can be surprisingly costly
  • One might wish to proceed in the reverse
    direction
  • Data mining applications the contents of a large
    database are not known beforehand

3
Why unsupervised?
  • In many applications, the characteristics of the
    patterns can change slowly with time
  • To find features that will be useful for
    categorization
  • It may be valuable to perform exploratory data
    analysis, and thereby gain some insight into the
    nature of the data

4
Methods for unsupervised classification
  • While some of the resulting clustering procedures
    have no known significant theoretical properties,
    they are still among the more useful tools for
    pattern recognition problems
  • K-means, ISODATA, Hierarchical clustering etc

5
Mixture densities and identifiability
  • We assume that
  • We know the complete probability structure for
    the problem with the sole exception of the values
    of some parameters

6
We assume that
  • We know the complete probability structure for
    the problem with the sole exception of the values
    of some parameters
  • The samples come from a known number c of classes
  • The prior probability P(?j) for each class are
    known, j1,,c
  • The forms for the class-conditional probability
    densities p(x ?j ,?j) are known, for j1,,c.
  • The values for the c parameter vectors ?1, ?c
    are unknown
  • The category labels are unknown

7
Mixture densities and identifiability
  • Our basic goal is to use samples drawn from the
    mixture density to estimate the unknown parameter
    vector ?

8
Mixture densities and identifiability
  • Our basic goal is to use samples drawn from the
    mixture density to estimate the unknown parameter
    vector ?
  • Before seeking explicit solutions to this
    problem, let us ask whether or not it is possible
    in principle to recover ? from the mixture
  • If there is only one value of ? that produce the
    observed values for p(x?), then a solution is at
    least possible in principle
  • If several different values of ? can produce the
    same values for p(x?), then there is no hope of
    obtaining a unique solution

9
Mixture densities and identifiability
  • A density p(x?) is said to be identifiable if
    ??? implies that there exists an x such that
    p(x?)?p(x?)
  • The study of unsupervised learning is greatly
    simplified if we restrict ourselves to
    identifiable mixtures
  • Fortunately, most mixtures of commonly
    encountered density functions are identifiable

10
Mixture densities and identifiability
  • A simple example
  • Suppose that we know for our data that
    P(x1?)0.6 and hence that P(x0?)0.4
  • Then we know the function P(x?), but we cannot
    determine ?, and hence cannot extract the
    component distributions completely unidentifiable

11
Mixture densities and identifiability
  • For the continuous case, the problems are less
    severe, although certain minor difficulties can
    arise due to the possibility of special cases
  • The density cannot be uniquely identifiable if
    P(?1) P( ?2), since then ?1 and ?2 are
    interchangeable without affecting p(x?)

12
Maximum-Likelihood Estimates
  • Suppose that we are given a set of n unlabeled
    samples drawn independently from the mixture
    density
  • where the ? is fixed but unknown
  • The likelihood of the observed samples is

13
Maximum-Likelihood Estimates
  • The maximum-likelihood estimate is the value of ?
    that maximizes p(D?)

14
Maximum-Likelihood Estimates
  • The gradient must vanish at the value of ?i that
    maximizes l, i.e. ?i must satisfy the conditions
  • The maximum-likelihood estimate of the
    probability of a category is the average over the
    entire data set of the estimate derived from each
    sample, i.e.,
  • Bayes theorem
  • And

(These can be shown)
15
Application to Normal Mixture
  • It is enlightening to see how these general
    results apply to the case where the component
    densities are multivariate normal
  • Case 1 the simplest, and it will be considered
    in detail because of its pedagogical value
  • Case 2 more realistic
  • Case 3 the problem we face on encountering a
    completely unknown set of data it cannot be
    solved by maximum-likelihood methods, unknown
    number of classes (sec. 10.10)

X known, ? unknown
16
Application to Normal Mixture
  • Case 1 Unknown Mean Vectors
  • The only unknown quantities are the mean vectors
    mi
  • A weighted average of the samples the weight
    for the kth sample is an estimate of how likely
    it is that xk belongs to the ith class.

17
Application to Normal Mixture
  • A weighted average of the samples
  • Weight for the kth sample an estimate of how
    likely it is that xk belongs to the ith class.
  • Case 1 Unknown Mean Vectors (cont.)
  • Unfortunately, the equation does not give
    explicit estimate of mi
  • If we substitute
  • we obtain simultaneous nonlinear equations,
    which do not have a unique solution, and we must
    test the solutions we get to find the one that
    actually maximizes the likelihood

18
Application to Normal Mixture
  • If we have some way of obtaining fairly good
    initial estimate for the unknown means, the
    following iterative scheme could improve the
    estimates

19
Application to Normal Mixture
  • Case 1 Unknown Mean Vectors
  • Source density true mean m1-2, m22
  • 25 samples were drawn from

20
Application to Normal Mixture
  • 25 samples are given
  • True mean m1-2, m22
  • Max value of log-likelihood occurs at m1
    -2.130, m2 1.668
  • Another comparable peak l value at m1 2,
    m2-1.257

21
Application to Normal Mixture
  • Case 2 All Parameters Unknown
  • If mi, Si, and P(?j) are all unknown and if no
    constraints are placed on the covariance matrix
  • The maximum-likelihood principle yields useless
    singular solutions
  • A simple example
  • The likelihood function for n samples drawn from
    this density is simply the product of the n
    densities.
  • Suppose we make the choice mx1,
  • For the rest of the samples

22
Application to Normal Mixture
  • Case 2 All Parameters Unknown (cont.)
  • By letting s approach zero we can make the
    likelihood arbitrarily large, and the
    maximum-likelihood solution is singular

23
Application to Normal Mixture
  • Case 2 All Parameters Unknown
  • A meaningful solution can be obtained if we look
    at the largest of the finite local maxima of the
    likelihood function (use eq. 11-13)

24
Application to Normal Mixture
  • Case 2 All Parameters Unknown (cont.)
  • By substituting the previous results to eq. 12,
  • Expressions for the local max-likelihood
    estimates of
  • parameters

25
k-means (or c-means) clustering algorithm
  • Approximation of previous statistical methods
    simplifies computation and accelerate convergence
  • It is clear that P_hat(wi,theta_hat) is large
    when the Mahalanobis distance is small, since
  • Suppose covariance matrix, and P_hat is
    approximated as 1 or 0 (according to the distance
    between xk and mm)
  • ? Eq. 25 (update eq. for the mean) leads to
    k-means procedure

26
k-means (or c-means) clustering algorithm
  • Suppose diagonal covariance matrix, and P_hat is
    approximated as 1 or 0 (according to the distance
    between xk and mm)
  • ? Update eq. for the mean leads to k-means
    procedure

27
K-means clustering algorithm
  • Begin initialize n, c, m1, m2, mc
  • Do classify n samples according to nearest mi
  • recompute mi
  • until no change in mi
  • Return m1, m2, mc
  • Approximate procedure for obtaining
    max-likelihood estimates for the mean
  • Or, a iterative optimization procedure for the
    minimization of a squared-error criterion function

28
K-means clustering algorithm
29
K-means clustering algorithm
30
K-means clustering algorithm
31
Application to normal mixture
32
Cluster validity measure
33
Data description and clustering
  • Unsupervised classification (clustering) ?
    learning the structure of multidimensional
    patterns from a set of unlabeled samples
  • These samples may form clouds of points in a
    d-dimensional space
  • Suppose that we knew that these points came from
    a single normal distribution
  • The most we could learn from the data contained
    in the sufficient statistics the sample mean
    and the sample covariance matrix
  • The sample mean the center of gravity of the
    cloud
  • The covariance matrix the amount of data
    scatters along various directions around the mean
  • However, if normal distribution cannot be
    assumed,
  • this statistics can give a very misleading
    description of the data

34
Data description and clustering
  • These four data are identical up to second-order
    statistics
  • However, if normal distribution cannot be
    assumed,
  • this statistics can give a very misleading
    description of the data

35
Data description and clustering
  • If we assume that the samples come from a
    mixture of c normal distributions, we can
    approximate a greater variety of situations
  • Clustering procedures
  • Yields a data description in terms of clusters
    or groups of data points that possess strong
    internal similarities
  • We should use a proper criterion function
    (similarity measure)

36
Data description and clustering
  • Similarity Measures
  • Once the clustering problem is described as one
    of finding natural groupings in a set of data, we
    are obliged to define what we mean by a natural
    grouping.
  • How should one measure the similarity between
    samples?
  • How should one evaluate a partitioning of a set
    of samples into clusters?

37
Data description and clustering
  • The most obvious measure of the similarity
    between two samples is the distance between them
  • A suitable metric and computation of the matrix
    of distances between all pairs of samples.
  • One would expect that
  • The distance between samples in the same cluster
    ltlt The distance between samples in different
    clusters

38
Data description and clustering
  • d0 threshold distance

39
Similarity measures Euclidean distance
  • The results of clustering depend on the choice of
    Euclidean distance as a measure of dissimilarity
  • Clusters defined by Euclidean distance will be
    invariant to translations or rotations in feature
    space.
  • But, they will not be invariant
  • to linear transformations in general
  • To achieve invariance, normalization of data
    before clustering.

40
Data description and clustering
  • Similarity Measures (cont.)
  • Note that this kind of normalization is
    necessarily desirable
  • Ex the matter of translating and scaling the
    axes so that each feature has zero mean and unit
    variance.
  • Routine normalization may be less than helpful in
    the cases of greatest interest (reduces the
    separation), as shown below

41
Criterion functions for clustering
  • The criterion function (to be optimized)
  • We should define a criterion function that
    measures the clustering quality of any partition
    of the data
  • Finding the partition that extremizes (maximize
    or minimize) the criterion function
  • The Sum-of-Squared-Error Criterion
  • The simplest and most widely used criterion
    function for clustering

42
Criterion functions for clustering
  • Je determined by how the samples are grouped
    into clusters and the number of clusters (minimum
    variance partitions)
  • Problems
  • The presence of outliers or wild shots (or
    outliers)
  • When there are a lot of differences in the number
    of samples in different clusters
  • If additional considerations render the results
    of minimizing Je unsatisfactory, these
    considerations should be used in formulation a
    better criterion function
Write a Comment
User Comments (0)
About PowerShow.com