Title: Ch. 10. Unsupervised Learning and Clustering
1Ch. 10. Unsupervised Learning and Clustering
2Introduction
- Supervised vs. unsupervised
- Why unsupervised classification?
- Collecting and labeling a large set of sample
patterns can be surprisingly costly - One might wish to proceed in the reverse
direction - Data mining applications the contents of a large
database are not known beforehand
3Why unsupervised?
- In many applications, the characteristics of the
patterns can change slowly with time - To find features that will be useful for
categorization - It may be valuable to perform exploratory data
analysis, and thereby gain some insight into the
nature of the data
4Methods for unsupervised classification
- While some of the resulting clustering procedures
have no known significant theoretical properties,
they are still among the more useful tools for
pattern recognition problems - K-means, ISODATA, Hierarchical clustering etc
5Mixture densities and identifiability
- We assume that
- We know the complete probability structure for
the problem with the sole exception of the values
of some parameters
6We assume that
- We know the complete probability structure for
the problem with the sole exception of the values
of some parameters - The samples come from a known number c of classes
- The prior probability P(?j) for each class are
known, j1,,c - The forms for the class-conditional probability
densities p(x ?j ,?j) are known, for j1,,c. - The values for the c parameter vectors ?1, ?c
are unknown - The category labels are unknown
7Mixture densities and identifiability
- Our basic goal is to use samples drawn from the
mixture density to estimate the unknown parameter
vector ?
8Mixture densities and identifiability
- Our basic goal is to use samples drawn from the
mixture density to estimate the unknown parameter
vector ? - Before seeking explicit solutions to this
problem, let us ask whether or not it is possible
in principle to recover ? from the mixture - If there is only one value of ? that produce the
observed values for p(x?), then a solution is at
least possible in principle - If several different values of ? can produce the
same values for p(x?), then there is no hope of
obtaining a unique solution
9Mixture densities and identifiability
- A density p(x?) is said to be identifiable if
??? implies that there exists an x such that
p(x?)?p(x?) - The study of unsupervised learning is greatly
simplified if we restrict ourselves to
identifiable mixtures - Fortunately, most mixtures of commonly
encountered density functions are identifiable
10Mixture densities and identifiability
- A simple example
- Suppose that we know for our data that
P(x1?)0.6 and hence that P(x0?)0.4 - Then we know the function P(x?), but we cannot
determine ?, and hence cannot extract the
component distributions completely unidentifiable
11Mixture densities and identifiability
- For the continuous case, the problems are less
severe, although certain minor difficulties can
arise due to the possibility of special cases - The density cannot be uniquely identifiable if
P(?1) P( ?2), since then ?1 and ?2 are
interchangeable without affecting p(x?)
12Maximum-Likelihood Estimates
- Suppose that we are given a set of n unlabeled
samples drawn independently from the mixture
density -
- where the ? is fixed but unknown
- The likelihood of the observed samples is
13Maximum-Likelihood Estimates
- The maximum-likelihood estimate is the value of ?
that maximizes p(D?)
14Maximum-Likelihood Estimates
- The gradient must vanish at the value of ?i that
maximizes l, i.e. ?i must satisfy the conditions - The maximum-likelihood estimate of the
probability of a category is the average over the
entire data set of the estimate derived from each
sample, i.e., - Bayes theorem
(These can be shown)
15Application to Normal Mixture
- It is enlightening to see how these general
results apply to the case where the component
densities are multivariate normal - Case 1 the simplest, and it will be considered
in detail because of its pedagogical value - Case 2 more realistic
- Case 3 the problem we face on encountering a
completely unknown set of data it cannot be
solved by maximum-likelihood methods, unknown
number of classes (sec. 10.10)
X known, ? unknown
16Application to Normal Mixture
- Case 1 Unknown Mean Vectors
- The only unknown quantities are the mean vectors
mi
- A weighted average of the samples the weight
for the kth sample is an estimate of how likely
it is that xk belongs to the ith class.
17Application to Normal Mixture
- A weighted average of the samples
- Weight for the kth sample an estimate of how
likely it is that xk belongs to the ith class.
- Case 1 Unknown Mean Vectors (cont.)
- Unfortunately, the equation does not give
explicit estimate of mi - If we substitute
-
- we obtain simultaneous nonlinear equations,
which do not have a unique solution, and we must
test the solutions we get to find the one that
actually maximizes the likelihood
18Application to Normal Mixture
- If we have some way of obtaining fairly good
initial estimate for the unknown means, the
following iterative scheme could improve the
estimates
19Application to Normal Mixture
- Case 1 Unknown Mean Vectors
- Source density true mean m1-2, m22
- 25 samples were drawn from
20Application to Normal Mixture
- 25 samples are given
- True mean m1-2, m22
- Max value of log-likelihood occurs at m1
-2.130, m2 1.668 - Another comparable peak l value at m1 2,
m2-1.257
21Application to Normal Mixture
- Case 2 All Parameters Unknown
- If mi, Si, and P(?j) are all unknown and if no
constraints are placed on the covariance matrix - The maximum-likelihood principle yields useless
singular solutions - A simple example
- The likelihood function for n samples drawn from
this density is simply the product of the n
densities. - Suppose we make the choice mx1,
- For the rest of the samples
22Application to Normal Mixture
- Case 2 All Parameters Unknown (cont.)
- By letting s approach zero we can make the
likelihood arbitrarily large, and the
maximum-likelihood solution is singular
23Application to Normal Mixture
- Case 2 All Parameters Unknown
- A meaningful solution can be obtained if we look
at the largest of the finite local maxima of the
likelihood function (use eq. 11-13)
24Application to Normal Mixture
- Case 2 All Parameters Unknown (cont.)
- By substituting the previous results to eq. 12,
- Expressions for the local max-likelihood
estimates of - parameters
25k-means (or c-means) clustering algorithm
- Approximation of previous statistical methods
simplifies computation and accelerate convergence - It is clear that P_hat(wi,theta_hat) is large
when the Mahalanobis distance is small, since -
- Suppose covariance matrix, and P_hat is
approximated as 1 or 0 (according to the distance
between xk and mm) - ? Eq. 25 (update eq. for the mean) leads to
k-means procedure
26k-means (or c-means) clustering algorithm
- Suppose diagonal covariance matrix, and P_hat is
approximated as 1 or 0 (according to the distance
between xk and mm) - ? Update eq. for the mean leads to k-means
procedure
27K-means clustering algorithm
- Begin initialize n, c, m1, m2, mc
- Do classify n samples according to nearest mi
- recompute mi
- until no change in mi
- Return m1, m2, mc
- Approximate procedure for obtaining
max-likelihood estimates for the mean - Or, a iterative optimization procedure for the
minimization of a squared-error criterion function
28K-means clustering algorithm
29K-means clustering algorithm
30K-means clustering algorithm
31Application to normal mixture
32Cluster validity measure
33Data description and clustering
- Unsupervised classification (clustering) ?
learning the structure of multidimensional
patterns from a set of unlabeled samples - These samples may form clouds of points in a
d-dimensional space - Suppose that we knew that these points came from
a single normal distribution - The most we could learn from the data contained
in the sufficient statistics the sample mean
and the sample covariance matrix - The sample mean the center of gravity of the
cloud - The covariance matrix the amount of data
scatters along various directions around the mean
- However, if normal distribution cannot be
assumed, - this statistics can give a very misleading
description of the data
34Data description and clustering
- These four data are identical up to second-order
statistics
- However, if normal distribution cannot be
assumed, - this statistics can give a very misleading
description of the data
35Data description and clustering
- If we assume that the samples come from a
mixture of c normal distributions, we can
approximate a greater variety of situations - Clustering procedures
- Yields a data description in terms of clusters
or groups of data points that possess strong
internal similarities - We should use a proper criterion function
(similarity measure)
36Data description and clustering
- Similarity Measures
- Once the clustering problem is described as one
of finding natural groupings in a set of data, we
are obliged to define what we mean by a natural
grouping. - How should one measure the similarity between
samples? - How should one evaluate a partitioning of a set
of samples into clusters?
37Data description and clustering
- The most obvious measure of the similarity
between two samples is the distance between them - A suitable metric and computation of the matrix
of distances between all pairs of samples. - One would expect that
- The distance between samples in the same cluster
ltlt The distance between samples in different
clusters
38Data description and clustering
39Similarity measures Euclidean distance
- The results of clustering depend on the choice of
Euclidean distance as a measure of dissimilarity - Clusters defined by Euclidean distance will be
invariant to translations or rotations in feature
space. - But, they will not be invariant
- to linear transformations in general
- To achieve invariance, normalization of data
before clustering.
40Data description and clustering
- Similarity Measures (cont.)
- Note that this kind of normalization is
necessarily desirable - Ex the matter of translating and scaling the
axes so that each feature has zero mean and unit
variance. - Routine normalization may be less than helpful in
the cases of greatest interest (reduces the
separation), as shown below
41Criterion functions for clustering
- The criterion function (to be optimized)
- We should define a criterion function that
measures the clustering quality of any partition
of the data - Finding the partition that extremizes (maximize
or minimize) the criterion function - The Sum-of-Squared-Error Criterion
- The simplest and most widely used criterion
function for clustering
42Criterion functions for clustering
- Je determined by how the samples are grouped
into clusters and the number of clusters (minimum
variance partitions) - Problems
- The presence of outliers or wild shots (or
outliers) - When there are a lot of differences in the number
of samples in different clusters - If additional considerations render the results
of minimizing Je unsatisfactory, these
considerations should be used in formulation a
better criterion function