Title: Three examples of the EM algorithm
1Three examples of the EM algorithm
- Week 12, Lecture 1
- Statistics 246, Spring 2002
2The estimation of linkage from the offspring of
selfed heterozygotesR A Fisher and Bhai
Balmukand, Journal of Genetics 20 79-92
(1928)See also Collected Papers of R A
Fisher, Volume II, Paper 71, pp 285-298.
3The problem
- In modern terminology, we have two linked
bi-allelic loci, A and B say, with alleles A and
a, and B and b, respectively, where A is dominant
over a and B is dominant over b. - A double heterozygote AaBb will produce gametes
of four types AB, Ab, aB and ab. - Since the loci are linked, the types AB and ab
will appear with a frequency different from that
of Ab and aB, say 1-r and r, respectively, in
males, and 1-r and r respectively in females. - Here we suppose that the parental origin of these
heterozygotes is from the mating AABB ? aabb, so
that r and r are the male and female
recombination rates between the two loci. - The problem is to estimate r and r, if possible,
from the offspring of selfed double heterozygotes.
4Offspring genotypic and phenotypic frequencies
- Since gametes AB, Ab, aB and ab are produced in
proportions (1-r)/2, r/2, r/2 and (1-r)/2
respectively by the male parent, and (1-r)/2,
r/2, r/2 and (1-r)/2 respectively by the
female parent, zygotes with genotypes AABB,
AaBB, etc, are produced with frequencies
(1-r)(1-r)/4, (1-r)r/4, etc. - Exercise Complete the Punnett square of
offspring genotypes and their associated
frequencies. - The problem here is this although there are 16
distinct offspring genotypes, taking parental
origin into account, the dominance relations
imply that we only observe 4 distinct phenotypes,
which we denote by AB, Ab, aB and ab.
Here A (res. B) denotes the dominant, while a
(resp. b) denotes the recessive phenotype
determined by the alleles at A (resp. B).
5Offspring genotypic and phenotypic probabilities,
cont.
- Thus individuals with genotypes AABB, AaBB, AABb
or AaBb, which account for 9/16 gametic
combinations (check!), all exhibit the phenotype
AB, i.e. the dominant alternative in both
characters, while those with genotypes AAbb or
Aabb (3/16) exhibit the phenotype Ab, those
with genotypes aaBB and aaBb (3/16) exhibit the
phenotype aB, and finally the double recessives
aabb (1/16) exhibit the phenotype ab. - It is a slightly surprising fact that the
probabilities of the four phenotypic classes are
definable in terms of the parameter - (1-r)(1-r), as follows ab has probability
?/4 (easy to see), aB and Ab both have
probabilities (1-?)/4, while AB has probability
1 minus the sum of the preceding, which is
(2?)/4. - Exercise Calculate these phenotypic
probabilities.
6Estimation of ?
- Now suppose we have a random sample of n
offspring from the selfing of our double
heterozygote. Thus the 4 phenotypic classes will
be represented roughly in proportion to their
theoretical probabilities, their joint
distribution being multinomial - Mult n (2 ?)/4, (1-?)/4,
(1-?)/4, ?/4. - Note that here neither r nor r will be
separately estimable from these data, but only
the product (1-r)(1-r). Note that since we know
that r1/2 and r1/2, it follows that ? 1/4. - How do we estimate ?? Fisher and Balmukand
discuss a variety of methods that were in the
literature at the time they wrote, and compare
them with maximum likelihood, which is the method
of choice in problems like this. We describe a
variant on their approach to illustrate the EM
algorithm.
7The incomplete data formulation
- Let us denote (cf p. 26 of Week 11b) the
counts of the 4 phenotypic classes by y1, y2, y3
and y4, these having probabilities (2 ?)/4, (1-
?)/4, (1-?)/4 and ?/4, respectively. Now the
probability of the observing genotype AABB is
?/4, just as it is with aabb, and although this
genotype is phenotypically indistinguishable from
the 8 others with phenotype AB, it is
convenient to imagine that we can distinguish
them.So let us denote their count by x1, and let
x2 denote count of the remainder of that class,
so that x1x2 y1. Note that x2 has marginal
probability 1/2. In the jargon of the EM
algorithm, x1 and x2 are missing data, as we only
observe their sum y1. Next, as in p.26 of Week
11b, we let y2x3, y3x4 and y4x5. We now
illustrate the approach of the EM algorithm,
referring to material in Week 9b and Week 11b for
generalities. -
8The EM algorithm for this problem
- The complete data log likelihood at ? is (Ex
Check) - (x2x5)log? (x3x4)log(1- ?).
- The expected value of the complete data log
likelihood given observed data taken at ?
(E-step think of ? as ?-initial) is - (E?(x2y1)y4)log? (y2y3)log(1-?).
- Now E?(x2y1) is just ky1, where k ?/(2?).
(Ex Check.) - The maximum over ? of the expected value of the
complete data log likelihood given observed data
taken at ? (M-step) - occurs at ? (ky1y4)/(ky1y2y3y4).
(Ex Check) - Here we think of ? as ?-next.
- It should now be clear how the E-step
(calculation of k) and the M-step (calculation of
?) can be iterated.
9Comments on this example
- This completes our discussion of this example.
It appeared in the famous EM paper (Dempster,
Laird and Rubin, JRSSB 1977) without any
explanation of its genetic origins. Of course it
is an illustration of the EM only, for the actual
likelihood equation generated by the observed
data only is a quadratic, and so easy to solve
(see Fisher Balmukand). Thus it is not
necessary to use the EM in this case (some would
say in any case, but that is for another time).
We have omitted any of the fascinating detail
provided in Fisher and Balmukand, and similarly
in Dempster et al. Read these papers both are
classics, with much of interest to you. Rather
than talk about details concerning the EM (most
importantly, starting and stopping it, the issue
of global max, and SEs for parameter estimates),
I turn to another important EM example mixtures.
10Fitting a mixture model by EM to discover motifs
in biopolymersT L Bailey and C Elkan, UCSD
Technical Report CS94-351 ISMB94.
- Here we outline some more EM theory, this
being relevant to motif discovery. We follow the
above report, as we will be discussing the
program MEME written by these authors in a later
lecture. This part is called MM. - A finite mixture model supposes that data X
(X1,,Xn) arises from two or more groups with
known distributions but different, unknown
parameters ? (??1, , ??g), where g is the
number of groups, and mixing parameters ?
(?1,, ?g), where the ?s are non-negative and sum
to 1. - It is convenient to introduce indicator
vectors Z (Z1,,Zn), where Zi (Zi1,,Zig),
and Zij 1 if Xi is from group j, and 0
otherwise. Thus Zi gives group membership for the
ith sample. It follows that pr(Zij 1 ?, ?)
?j . For any given i, all Zij are 0 apart from
one.
11Complete data log likelihood
- Under the assumption that the pairs (Zi,Xi)
are mutually independent, their joint density may
be written (Exercise Carry out this calculation
in detail.) - pr(Z, X ?, ?) ?ij ?j pr(Xi
?j) Zij - The complete-data log likelihood is thus
- log L(?, ? Z, X) ?? Zij log ?j
pr(Xi ?j) . - The EM algorithm iteratively computes the
expectation of this quantity given the observed
data X, and initial estimates ? and ? of ? and
? (the E-step), and then maximizes the result in
the free variables ? and ? leading to new
estimates ? and ? (the M-step). Our interest
here is in the particular calculations necessary
to carry out these two steps.
12Mixture models the E-step
- Since the log likelihood is the sum of over i
and j of terms multiplying Zij, and these are
independent across i, we need only consider the
expectation of one such, given Xi. Using initial
parameter values ? and ?, and the fact that
the Zij are binary, we get - E( Zij X, ?, ?) ?j pr(Xi?j)/ ?k
?kpr(Xi?k) Zij, say. - Exercise Obtain this result.
13Mixture models the M-step
- Here our task is to maximize the result of an
E-step - ?? Zij ?j ?? Zij log pr(Xi
?j). - The maximization over ? is clearly
independent of the rest and is readily seen to be
achieved (Ex check this) with - ?j ?iZij / n.
- Maximizing over ? requires that we specify
the model in more detail. The case of interest to
us is where g 2, and the distributions for
class 1 (the motif) and class 2 (the background)
are given by position specific and a general
multinomial distribution, respectively.
14Mixture models the M-step, cont.
- Our initial observations are supposed to
consist of N sequences from an L-letter alphabet.
Unlike what we did in the last lecture, these
sequences are now broken up into all n
overlapping subsequences of length w, and these
are our Xi. We need w1 sets of probabilities,
namely fjk and f0k, where j1,,w (the length of
the motif) and k runs over the symbol alphabet. - With these parameters, we can write
-
- pr(Xi ?1) ?j?k fjk I(k,Xij) , and
pr(Xi ?2) ?j?k f0k I(k,Xij) -
- where Xij is the letter in the jth position
of sample i, and I(k,a) 1 if aak, and 0
otherwise. With this notation, for k1,L, write -
- c0k ?? Zi2I(k,Xij), and cjk ??
Zi1I(k,Xij). -
- Here c0k is the expected number of times
letter ak appears in the background, and cjk the
expected number of times ak appears in
occurrences of the motif in the data.
15Mixture models the M-step completed.
- With these preliminaries, it is
straightforward to maximize the expected
complete-data log likelihood, given the
observations X, evaluated at initial parameters
? and ?. Exercise Fill in the details missing
below. We obtain -
- fjk cjk / ?k1L cjk ,
j 0,1,,w k 1,,L. -
- In practice, care must be taken to avoid zero
frequencies, so that either one uses explicit
Dirichlet prior distributions, or one adds small
constants ?j ,??j ?, giving -
- fjk (cjk ?j )/ (?k1L cjk ?) ,
j 0,1,,w k 1,,L. -
16Comment on the EM algorithm
- A common misconception concerning the EM
algorithm is that we are estimating or predicting
the missing data, plugging that estimate or
prediction into the complete data log likelihood,
completing the data you might say, and then
maximizing this in the free parameters, as though
we had complete data. - This is emphatically NOT what is going on.
- A more accurate description might be this. We
are using the inferred missing data to weight
different parts of the complete data log
likelihood differently in such a way that the
pieces combine into the maximum likelihood
estimates. -
17An Expectation Maximization (EM) Algorithm for
the Identification and Characterization of Common
Sites in Unaligned Biopolymer SequencesC E
Lawrence and A A ReillyPROTEINS Structure,
Function and Genetics 741-51 (1990)
- This paper was the precursor of the
Gibbs-Metropolis algorithm we described in
Week11b. We now briefly describe the algorithm
along the lines of the discussion just given, but
in the notation used in Week11b. On p. .. of
that lecture, the full data log likelihood was
given, without the term for the marginal
distribution of A being visible. We suppose that
the ak are independent, and uniformly distributed
over 1,..,nk-W1. Because this term does not
depend on either ?0 or ?, it disappears in the
likelihood proportionality constant. Thus the
complete data log likelihood is - log L(?0,? R,A) h(RAc)log?0
?jh(RAj-1)log?j , - where we use the notation ?log?0
?i?ilog?0,i cf p.12, Week 11b.
18The E-step in this case.
- The expected value of the complete data log
likelihood given the observed data and initial
parameters ?0 and ? is just
?A pr(A R, ?0,?) log pr(R, A ?0
, ?) () - where the sum is over all A a1,,aK, and
so our task is to calculate the first term. Now
we are treating all the rows (sequences) as
mutually independent, so pr(A R, ?0,?)
factorizes in k, and we need only deal with a
typical row, the kth, say. Letting ak denote the
random variable corresponding to the start of the
motif in the kth row, then - pr(ak i)
1/(nk-W1), i1, , nk-W1. - We can readily calculate pr(ak i ?0,?,
R) by Bayes theorem, and these get multiplied
and inserted in () above. Exercise Carry out
this calculation.
19The M-step in this case.
- Once we have calculated the expected value of
the complete data log likelihood given the
observed data and initial parameters ?0 and ?,
we then maximize it in the free variables ?0 and
?, leading to new parameters ?0 and ?. - How is this done? Without giving the gory
details, just notice that () is a weighted
combination of multinomial log likelihoods just
like the one we met in our previous example, the
mixture model. There the weights were Zs, and
here they are pr(A R, ?0 ,?)s. It follows
(Exercise Fill in the details) that the
maximizing values of ?0 and ?, which we denote by
?0 and ?, are ratios of expected counts
similar to the c0 and cj in the mixture
discussion. As there, we will want to deal with
small or zero counts by invoking Dirichlet priors.