Ch. 10. Unsupervised Learning and Clustering


One might wish to proceed in the reverse direction. Data mining applications: the contents of a large database are not known beforehand ...

Ch. 10. Unsupervised Learning and Clustering
  • Supervised vs. unsupervised
  • Why unsupervised classification?
  • Collecting and labeling a large set of sample
    patterns can be surprisingly costly
  • One might wish to proceed in the reverse
  • Data mining applications the contents of a large
    database are not known beforehand

Why unsupervised?
  • In many applications, the characteristics of the
    patterns can change slowly with time
  • To find features that will be useful for
  • It may be valuable to perform exploratory data
    analysis, and thereby gain some insight into the
    nature of the data

Methods for unsupervised classification
  • While some of the resulting clustering procedures
    have no known significant theoretical properties,
    they are still among the more useful tools for
    pattern recognition problems
  • K-means, ISODATA, Hierarchical clustering etc

Mixture densities and identifiability
  • We assume that
  • We know the complete probability structure for
    the problem with the sole exception of the values
    of some parameters

  • The samples come from a known number c of classes
  • The prior probability P(?j) for each class are
    known, j1,,c
  • The forms for the class-conditional probability
    densities p(x ?j ,?j) are known, for j1,,c.
  • The values for the c parameter vectors ?1, ?c
    are unknown
  • The category labels are unknown

Mixture densities and identifiability
  • Our basic goal is to use samples drawn from the
    mixture density to estimate the unknown parameter
    vector ?

Mixture densities and identifiability
Mixture densities and identifiability
  • A density p(x?) is said to be identifiable if
    ??? implies that there exists an x such that
  • The study of unsupervised learning is greatly
    simplified if we restrict ourselves to
    identifiable mixtures
  • Fortunately, most mixtures of commonly
    encountered density functions are identifiable

Mixture densities and identifiability
  • A simple example
  • Suppose that we know for our data that
    P(x1?)0.6 and hence that P(x0?)0.4
  • Then we know the function P(x?), but we cannot
    determine ?, and hence cannot extract the
    component distributions completely unidentifiable

Mixture densities and identifiability
  • For the continuous case, the problems are less
    severe, although certain minor difficulties can
    arise due to the possibility of special cases
  • The density cannot be uniquely identifiable if
    P(?1) P( ?2), since then ?1 and ?2 are
    interchangeable without affecting p(x?)

Maximum-Likelihood Estimates
  • Suppose that we are given a set of n unlabeled
    samples drawn independently from the mixture
  • where the ? is fixed but unknown
  • The likelihood of the observed samples is

Maximum-Likelihood Estimates
  • The maximum-likelihood estimate is the value of ?
    that maximizes p(D?)

Maximum-Likelihood Estimates
  • The gradient must vanish at the value of ?i that
    maximizes l, i.e. ?i must satisfy the conditions
  • The maximum-likelihood estimate of the
    probability of a category is the average over the
    entire data set of the estimate derived from each
    sample, i.e.,
  • Bayes theorem
  • And

(These can be shown)
Application to Normal Mixture
  • It is enlightening to see how these general
    results apply to the case where the component
    densities are multivariate normal
  • Case 1 the simplest, and it will be considered
    in detail because of its pedagogical value
  • Case 2 more realistic
  • Case 3 the problem we face on encountering a
    completely unknown set of data it cannot be
    solved by maximum-likelihood methods, unknown
    number of classes (sec. 10.10)

X known, ? unknown
Application to Normal Mixture
  • Case 1 Unknown Mean Vectors
  • The only unknown quantities are the mean vectors
  • A weighted average of the samples the weight
    for the kth sample is an estimate of how likely
    it is that xk belongs to the ith class.

  • Case 1 Unknown Mean Vectors (cont.)
  • Unfortunately, the equation does not give
    explicit estimate of mi
  • If we substitute
  • we obtain simultaneous nonlinear equations,
    which do not have a unique solution, and we must
    test the solutions we get to find the one that
    actually maximizes the likelihood

Application to Normal Mixture
  • If we have some way of obtaining fairly good
    initial estimate for the unknown means, the
    following iterative scheme could improve the

Application to Normal Mixture
  • Case 1 Unknown Mean Vectors
  • Source density true mean m1-2, m22
  • 25 samples were drawn from

Application to Normal Mixture
  • 25 samples are given
  • True mean m1-2, m22
  • Max value of log-likelihood occurs at m1
    -2.130, m2 1.668
  • Another comparable peak l value at m1 2,

Application to Normal Mixture
  • Case 2 All Parameters Unknown
  • If mi, Si, and P(?j) are all unknown and if no
    constraints are placed on the covariance matrix
  • The maximum-likelihood principle yields useless
    singular solutions
  • A simple example
  • The likelihood function for n samples drawn from
    this density is simply the product of the n
  • Suppose we make the choice mx1,
  • For the rest of the samples

Application to Normal Mixture
  • Case 2 All Parameters Unknown (cont.)
  • By letting s approach zero we can make the
    likelihood arbitrarily large, and the
    maximum-likelihood solution is singular

Application to Normal Mixture
  • Case 2 All Parameters Unknown
  • A meaningful solution can be obtained if we look
    at the largest of the finite local maxima of the
    likelihood function (use eq. 11-13)

Application to Normal Mixture
  • Case 2 All Parameters Unknown (cont.)
  • By substituting the previous results to eq. 12,
  • Expressions for the local max-likelihood
    estimates of
  • parameters

k-means (or c-means) clustering algorithm
  • Approximation of previous statistical methods
    simplifies computation and accelerate convergence
  • It is clear that P_hat(wi,theta_hat) is large
    when the Mahalanobis distance is small, since
  • Suppose covariance matrix, and P_hat is
    approximated as 1 or 0 (according to the distance
    between xk and mm)
  • ? Eq. 25 (update eq. for the mean) leads to
    k-means procedure

k-means (or c-means) clustering algorithm
  • Suppose diagonal covariance matrix, and P_hat is
    approximated as 1 or 0 (according to the distance
    between xk and mm)
  • ? Update eq. for the mean leads to k-means

K-means clustering algorithm
  • Begin initialize n, c, m1, m2, mc
  • Do classify n samples according to nearest mi
  • recompute mi
  • until no change in mi
  • Return m1, m2, mc
  • Approximate procedure for obtaining
    max-likelihood estimates for the mean
  • Or, a iterative optimization procedure for the
    minimization of a squared-error criterion function

K-means clustering algorithm
K-means clustering algorithm
K-means clustering algorithm
Application to normal mixture
Cluster validity measure
Data description and clustering
  • Unsupervised classification (clustering) ?
    learning the structure of multidimensional
    patterns from a set of unlabeled samples
  • These samples may form clouds of points in a
    d-dimensional space
  • Suppose that we knew that these points came from
    a single normal distribution
  • The most we could learn from the data contained
    in the sufficient statistics the sample mean
    and the sample covariance matrix
  • The sample mean the center of gravity of the
  • The covariance matrix the amount of data
    scatters along various directions around the mean
  • However, if normal distribution cannot be
  • this statistics can give a very misleading
    description of the data

Data description and clustering
  • These four data are identical up to second-order
Data description and clustering
  • If we assume that the samples come from a
    mixture of c normal distributions, we can
    approximate a greater variety of situations
  • Clustering procedures
  • Yields a data description in terms of clusters
    or groups of data points that possess strong
    internal similarities
  • We should use a proper criterion function
    (similarity measure)

Data description and clustering
  • Similarity Measures
  • Once the clustering problem is described as one
    of finding natural groupings in a set of data, we
    are obliged to define what we mean by a natural
  • How should one measure the similarity between
  • How should one evaluate a partitioning of a set
    of samples into clusters?

Data description and clustering
  • The most obvious measure of the similarity
    between two samples is the distance between them
  • A suitable metric and computation of the matrix
    of distances between all pairs of samples.
  • One would expect that
  • The distance between samples in the same cluster
    ltlt The distance between samples in different

Data description and clustering
  • d0 threshold distance

Similarity measures Euclidean distance
  • The results of clustering depend on the choice of
    Euclidean distance as a measure of dissimilarity
  • Clusters defined by Euclidean distance will be
    invariant to translations or rotations in feature
  • But, they will not be invariant
  • to linear transformations in general
  • To achieve invariance, normalization of data
    before clustering.

Data description and clustering
  • Similarity Measures (cont.)
  • Note that this kind of normalization is
    necessarily desirable
  • Ex the matter of translating and scaling the
    axes so that each feature has zero mean and unit
  • Routine normalization may be less than helpful in
    the cases of greatest interest (reduces the
    separation), as shown below

Criterion functions for clustering
  • The criterion function (to be optimized)
  • We should define a criterion function that
    measures the clustering quality of any partition
    of the data
  • Finding the partition that extremizes (maximize
    or minimize) the criterion function
  • The Sum-of-Squared-Error Criterion
  • The simplest and most widely used criterion
    function for clustering

Criterion functions for clustering
  • Je determined by how the samples are grouped
    into clusters and the number of clusters (minimum
    variance partitions)
  • Problems
  • The presence of outliers or wild shots (or
  • When there are a lot of differences in the number
    of samples in different clusters
  • If additional considerations render the results
    of minimizing Je unsatisfactory, these
    considerations should be used in formulation a
    better criterion function
