Title: Folie 1
1Maschinelles Lernen
Expectation Maximization (EM) AlgorithmusClusteri
ng (unsupervidiertes Lernen)
2Expectation Maximization Algorithmus
Aufgabe Gegeben sei eine Menge von Beobachtungen
?(xn)n1,,N Lerne eine Mischverteilung
Dabei ist T die Menge aller Parameter der
Mischverteilung, und Tj sind die Parameter der
j-ten Komponente der Mischverteilung.
p2( xT2 )
p1( xT1 )
Bem. Lernen von Mischverteilungen ist nützlich,
z.B. wenn man klassifizieren will Sind die a
priori Klassenwahr-scheinlichkeiten P(j) und die
Parameter Tj, d.h. die Komponenten pj(x Tj)
bekannt, so kann optimal gemäß des
Bayes-Klassifikators klassifiziert werden.
Dieses Problem ist uns in der ersten Vorlesung
bereits begegnet (Lernen der gemeinsamen
Längenverteilung von Seebarsch gegen Lachs),
allerdings sind in der jetzigen Situation die
Klassenlabels unbekannt.
3Expectation Maximization Algorithmus
Die Log-Likelihood beträgt
Führe neue Zufallsvariablen Z (zn)n1,,N ein,
die die Klassenzugehörigkeit von Beobachtung xn
anzeigen. Sind die zn bekannt, so berechnet sich
die Likelihood zu
Zur Berechnung dieses Ausdrucks benötigen wir die
Parameter T und die Wahrscheinlichkeiten P(zn).
4Expectation Maximization Algorithmus
5Expectation Step
Starte mit den alten Parametern Tjold und Pold(j)
und berechne die a posteriori Wahrscheinlichkeiten
für die Klassenzugehörigkeiten
Der Erwartungswert von E(X) bezüglich Z ist die
gewichtete Summe
6Expectation Step
Substituiere
Also
Dieser Ausdruck soll nach Tnewj und Pnew(j)
maximiert werden. Dies kann wegen der Gestalt des
Ausdrucks getrennt voneinander geschehen.
7Maximization Step
8Maximization Step
9Maximization Step
10Clustering
Partitioning methods. These usually require the
specification of the number of clusters. Then a
mechanism for apportioning objects to clusters
must be determined Advantage provides clusters
that satisfy an optimality criterion
(approximately) Disadvantage need initial K,
long computation time
Hierarchical methods These methods provide a
hierarchy of clusters, from the smallest, where
all objects are in one cluster, through to the
largest set, where each observation is in its own
cluster Advantage fast computation
(agglomerative clustering) Disadvantage rigid,
cannot correct later for erroneous decisions made
earlier
11K-means Clustering
12K-means Clustering
13Partitionsmethoden
If we measure the quality of an estimator with
the expected prediction error using the zero-one
loss function (which is standard for
classification), the optimal classifier is the
Bayes classifier (see chapter classification)
and has the form
Of course, we do not know the probability
distributions Pj for each class Cj, but we can
make some sensible assumptions on Pj Let the
predictor space IRd be equipped with some
distance measure d(a,b). Let Pj be a mononodal
distribution which is symmetrical around its mode
µj , i.e. there is a non-decreasing function pj
0,8) ? IR such that Pj(x) pj ( d(x,µj) ). If
we assume that all Pj have the same shape, i.e.
p1pkp, then
and consequently,
In other words, our considerations lead to a very
simple classification rule Given x, search the
class Cj whose mode µj is nearest to x, and
classify x as Cj. Note that this rule remains
unchanged for different choices of the function p
(which determines the shape of the distributions
Pj) !
14Partitionsmethoden
Still, we cannot classify x since we do not know
the modes µj . Under our model assumptions, the
set of parameters µ (µ1,, µk) completely
determines the probability distribution P as well
as the Bayes classifier C, and we can write down
the likelihood of observing the data D
(xj,C(xj)) , j1,,n , given µ
We would like to find the maximum likelihood
estimator for µ,
where
15Partitionsmethoden
(update labels)
(update centres)
It can be shown that the sequence µ(T) ,
T0,1,2, converges (Exercise!) and the process
stops. Since the corresponding sequence P(D
µ(T)), T0,1,... is monotonically increasing and
bounded by 1, it necessarily converges to a local
maximum of P(Dµ). BUT The above strategy does
not guarantee to find a global maximum!
16K-means Clustering, Beispiel
What happens if we take p(x) cexp(-x2/(2s2))
for some s and an appropriate normalizing
constant c ?
17K-means Clustering, Beispiel
Remember that in one of the early lectures, the
term on the right side was used to define the
centre of all the points xn which are labeled
Cj. For d the Euclidean distance, we proved that
the centre equals the arithmetic mean of the
points involved (we proved this for
one-dimensional data points xn, but this holds
for higher dimensions as well). So letting d be
the Euclidean distance, the procedure for the
estimation of becomes the so-called k-means
algorithm
18K-means Clustering, Beispiel
Genespressionsdaten
- Gene expression data on p genes (variables) for n
mRNA samples (observations) -
Samples 1,,p
Genes 1,..,p
gene expression level of gene j in mRNA sample i.
Task Find interesting clusters of genes, i.e.
genes with similar behaviour across samples.
19K-means Clustering, Beispiel
taken from Silicon Genetics
20K-means Clustering, Beispiel
21K-means Clustering
Raw data
Features were first standardized
Giving all attributes equal influence
(standardization) can obscure well-separated
groups
22K-means Clustering
- Advantages to using k-means
- With a large number of variables, k-means may be
computationally faster than hierarchical
clustering (if k is small). - k-means may produce tighter clusters than
hierarchical clustering, especially if the
clusters are globular. - Disadvantages to using k-means
- Difficulty in comparing quality of the clusters
(e.g. for different initial partitions or values
of k affect outcome). - Fixed number of clusters can make it difficult to
predict what k should be. - Does not work well with non-globular clusters.
- Different initial partitions can result in
different final clusters. It is helpful to rerun
the program using the same as well as different K
values, to compare the results achieved. -
23Hierarchisches Clustering, Dendrogramme
Hierarchische Methoden produzieren als Ergebnis
eine Baumstruktur, ein sog. Dendrogramm. Es wird
im Voraus keine Clusterzahl festgelegt.
Es gibt zwei prinzipielle Methoden der
Clustergenerierung divisive und agglomerative
Verfahren.
24Hierarchisches Clustering, Dendrogramme
25Hierarchisches Clustering, Agglomerative Methoden
Let d(G,H) be a function that maps any to subsets
G,H of the set of all data points to a
non-negative real value. Think of d as of a
distance measure for sets of data points. The
algorithm for agglomerative clustering is then
It easy to see that S(j) is a partition of the
data points that consists of n-j sets. Hence if
we want to obtain a clustering with k classes,
the partition S(n-k) provides a classification of
the data points into k mutually disjoint classes.
26Hierarchisches Clustering, Agglomerative Methoden
Single linkage
- The distance between two clusters is the minimal
distance between two objects, one from each
cluster - Single linkage only requires that a single
dissimilarity be small for two groups G and H to
be considered close together, irrespective of the
other observation dissimilarities between the
groups. - It will therefore have a tendency to combine, at
relatively low thresholds, observations linked by
a series of close intermediate observations
(chaining). - Disadvantage The clusters produced by single
linkage can violate the compactness property
that all observations within each cluster tend to
be similar to one another, based on the supplied
observation dissimilarities.
27Hierarchisches Clustering, Agglomerative Methoden
Complete linkage
- The distance between two clusters is the maximum
of the distances between two objects, one from
each cluster - Two groups G and H are considered close only if
all of the observations in their union are
relatively similar. - It tends to produce compact clusters with small
diameters, but can produce cluster that violate
the closeness property.
28Hierarchisches Clustering, Agglomerative Methoden
Average linkage
- The distance between two clusters is the average
of the pairwise distance between members of the
two clusters - Represent a compromise between the two extremes
of single and complete linkage. - Produce relative compact clusters that are
relatively far apart. - Disadvantage its results depend on the numerical
scale on which the observation dissimilarities
are measured.
Centroid linkage
- The distance between two clusters is distance
between their centroids
29Hierarchisches Clustering, Agglomerative Methoden
30Beispiel Two-way hierarchical clustering
clustering of samples across genesfind groups
of similar samples
clustering of genes across samples find groups
of similar genes
from Eisen, Spellman, Botstein et al.Yeast
compendium data
31Clustering Packete in R