Three examples of the EM algorithm - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Three examples of the EM algorithm

Description:

Three examples of the EM algorithm Week 12, Lecture 1 Statistics 246, Spring 2002 The estimation of linkage from the offspring of selfed heterozygotes R A Fisher and ... – PowerPoint PPT presentation

Number of Views:292
Avg rating:3.0/5.0
Slides: 20
Provided by: statBerke7
Category:

less

Transcript and Presenter's Notes

Title: Three examples of the EM algorithm


1
Three examples of the EM algorithm
  • Week 12, Lecture 1
  • Statistics 246, Spring 2002

2
The estimation of linkage from the offspring of
selfed heterozygotesR A Fisher and Bhai
Balmukand, Journal of Genetics 20 79-92
(1928)See also Collected Papers of R A
Fisher, Volume II, Paper 71, pp 285-298.
3
The problem
  • In modern terminology, we have two linked
    bi-allelic loci, A and B say, with alleles A and
    a, and B and b, respectively, where A is dominant
    over a and B is dominant over b.
  • A double heterozygote AaBb will produce gametes
    of four types AB, Ab, aB and ab.
  • Since the loci are linked, the types AB and ab
    will appear with a frequency different from that
    of Ab and aB, say 1-r and r, respectively, in
    males, and 1-r and r respectively in females.
  • Here we suppose that the parental origin of these
    heterozygotes is from the mating AABB ? aabb, so
    that r and r are the male and female
    recombination rates between the two loci.
  • The problem is to estimate r and r, if possible,
    from the offspring of selfed double heterozygotes.

4
Offspring genotypic and phenotypic frequencies
  • Since gametes AB, Ab, aB and ab are produced in
    proportions (1-r)/2, r/2, r/2 and (1-r)/2
    respectively by the male parent, and (1-r)/2,
    r/2, r/2 and (1-r)/2 respectively by the
    female parent, zygotes with genotypes AABB,
    AaBB, etc, are produced with frequencies
    (1-r)(1-r)/4, (1-r)r/4, etc.
  • Exercise Complete the Punnett square of
    offspring genotypes and their associated
    frequencies.
  • The problem here is this although there are 16
    distinct offspring genotypes, taking parental
    origin into account, the dominance relations
    imply that we only observe 4 distinct phenotypes,
    which we denote by AB, Ab, aB and ab.
    Here A (res. B) denotes the dominant, while a
    (resp. b) denotes the recessive phenotype
    determined by the alleles at A (resp. B).

5
Offspring genotypic and phenotypic probabilities,
cont.
  • Thus individuals with genotypes AABB, AaBB, AABb
    or AaBb, which account for 9/16 gametic
    combinations (check!), all exhibit the phenotype
    AB, i.e. the dominant alternative in both
    characters, while those with genotypes AAbb or
    Aabb (3/16) exhibit the phenotype Ab, those
    with genotypes aaBB and aaBb (3/16) exhibit the
    phenotype aB, and finally the double recessives
    aabb (1/16) exhibit the phenotype ab.
  • It is a slightly surprising fact that the
    probabilities of the four phenotypic classes are
    definable in terms of the parameter
  • (1-r)(1-r), as follows ab has probability
    ?/4 (easy to see), aB and Ab both have
    probabilities (1-?)/4, while AB has probability
    1 minus the sum of the preceding, which is
    (2?)/4.
  • Exercise Calculate these phenotypic
    probabilities.

6
Estimation of ?
  • Now suppose we have a random sample of n
    offspring from the selfing of our double
    heterozygote. Thus the 4 phenotypic classes will
    be represented roughly in proportion to their
    theoretical probabilities, their joint
    distribution being multinomial
  • Mult n (2 ?)/4, (1-?)/4,
    (1-?)/4, ?/4.
  • Note that here neither r nor r will be
    separately estimable from these data, but only
    the product (1-r)(1-r). Note that since we know
    that r1/2 and r1/2, it follows that ? 1/4.
  • How do we estimate ?? Fisher and Balmukand
    discuss a variety of methods that were in the
    literature at the time they wrote, and compare
    them with maximum likelihood, which is the method
    of choice in problems like this. We describe a
    variant on their approach to illustrate the EM
    algorithm.

7
The incomplete data formulation
  • Let us denote (cf p. 26 of Week 11b) the
    counts of the 4 phenotypic classes by y1, y2, y3
    and y4, these having probabilities (2 ?)/4, (1-
    ?)/4, (1-?)/4 and ?/4, respectively. Now the
    probability of the observing genotype AABB is
    ?/4, just as it is with aabb, and although this
    genotype is phenotypically indistinguishable from
    the 8 others with phenotype AB, it is
    convenient to imagine that we can distinguish
    them.So let us denote their count by x1, and let
    x2 denote count of the remainder of that class,
    so that x1x2 y1. Note that x2 has marginal
    probability 1/2. In the jargon of the EM
    algorithm, x1 and x2 are missing data, as we only
    observe their sum y1. Next, as in p.26 of Week
    11b, we let y2x3, y3x4 and y4x5. We now
    illustrate the approach of the EM algorithm,
    referring to material in Week 9b and Week 11b for
    generalities.

8
The EM algorithm for this problem
  • The complete data log likelihood at ? is (Ex
    Check)
  • (x2x5)log? (x3x4)log(1- ?).
  • The expected value of the complete data log
    likelihood given observed data taken at ?
    (E-step think of ? as ?-initial) is
  • (E?(x2y1)y4)log? (y2y3)log(1-?).
  • Now E?(x2y1) is just ky1, where k ?/(2?).
    (Ex Check.)
  • The maximum over ? of the expected value of the
    complete data log likelihood given observed data
    taken at ? (M-step)
  • occurs at ? (ky1y4)/(ky1y2y3y4).
    (Ex Check)
  • Here we think of ? as ?-next.
  • It should now be clear how the E-step
    (calculation of k) and the M-step (calculation of
    ?) can be iterated.

9
Comments on this example
  • This completes our discussion of this example.
    It appeared in the famous EM paper (Dempster,
    Laird and Rubin, JRSSB 1977) without any
    explanation of its genetic origins. Of course it
    is an illustration of the EM only, for the actual
    likelihood equation generated by the observed
    data only is a quadratic, and so easy to solve
    (see Fisher Balmukand). Thus it is not
    necessary to use the EM in this case (some would
    say in any case, but that is for another time).
    We have omitted any of the fascinating detail
    provided in Fisher and Balmukand, and similarly
    in Dempster et al. Read these papers both are
    classics, with much of interest to you. Rather
    than talk about details concerning the EM (most
    importantly, starting and stopping it, the issue
    of global max, and SEs for parameter estimates),
    I turn to another important EM example mixtures.

10
Fitting a mixture model by EM to discover motifs
in biopolymersT L Bailey and C Elkan, UCSD
Technical Report CS94-351 ISMB94.
  • Here we outline some more EM theory, this
    being relevant to motif discovery. We follow the
    above report, as we will be discussing the
    program MEME written by these authors in a later
    lecture. This part is called MM.
  • A finite mixture model supposes that data X
    (X1,,Xn) arises from two or more groups with
    known distributions but different, unknown
    parameters ? (??1, , ??g), where g is the
    number of groups, and mixing parameters ?
    (?1,, ?g), where the ?s are non-negative and sum
    to 1.
  • It is convenient to introduce indicator
    vectors Z (Z1,,Zn), where Zi (Zi1,,Zig),
    and Zij 1 if Xi is from group j, and 0
    otherwise. Thus Zi gives group membership for the
    ith sample. It follows that pr(Zij 1 ?, ?)
    ?j . For any given i, all Zij are 0 apart from
    one.

11
Complete data log likelihood
  • Under the assumption that the pairs (Zi,Xi)
    are mutually independent, their joint density may
    be written (Exercise Carry out this calculation
    in detail.)
  • pr(Z, X ?, ?) ?ij ?j pr(Xi
    ?j) Zij
  • The complete-data log likelihood is thus
  • log L(?, ? Z, X) ?? Zij log ?j
    pr(Xi ?j) .
  • The EM algorithm iteratively computes the
    expectation of this quantity given the observed
    data X, and initial estimates ? and ? of ? and
    ? (the E-step), and then maximizes the result in
    the free variables ? and ? leading to new
    estimates ? and ? (the M-step). Our interest
    here is in the particular calculations necessary
    to carry out these two steps.

12
Mixture models the E-step
  • Since the log likelihood is the sum of over i
    and j of terms multiplying Zij, and these are
    independent across i, we need only consider the
    expectation of one such, given Xi. Using initial
    parameter values ? and ?, and the fact that
    the Zij are binary, we get
  • E( Zij X, ?, ?) ?j pr(Xi?j)/ ?k
    ?kpr(Xi?k) Zij, say.
  • Exercise Obtain this result.

13
Mixture models the M-step
  • Here our task is to maximize the result of an
    E-step
  • ?? Zij ?j ?? Zij log pr(Xi
    ?j).
  • The maximization over ? is clearly
    independent of the rest and is readily seen to be
    achieved (Ex check this) with
  • ?j ?iZij / n.
  • Maximizing over ? requires that we specify
    the model in more detail. The case of interest to
    us is where g 2, and the distributions for
    class 1 (the motif) and class 2 (the background)
    are given by position specific and a general
    multinomial distribution, respectively.

14
Mixture models the M-step, cont.
  • Our initial observations are supposed to
    consist of N sequences from an L-letter alphabet.
    Unlike what we did in the last lecture, these
    sequences are now broken up into all n
    overlapping subsequences of length w, and these
    are our Xi. We need w1 sets of probabilities,
    namely fjk and f0k, where j1,,w (the length of
    the motif) and k runs over the symbol alphabet.
  • With these parameters, we can write
  • pr(Xi ?1) ?j?k fjk I(k,Xij) , and
    pr(Xi ?2) ?j?k f0k I(k,Xij)
  • where Xij is the letter in the jth position
    of sample i, and I(k,a) 1 if aak, and 0
    otherwise. With this notation, for k1,L, write
  • c0k ?? Zi2I(k,Xij), and cjk ??
    Zi1I(k,Xij).
  • Here c0k is the expected number of times
    letter ak appears in the background, and cjk the
    expected number of times ak appears in
    occurrences of the motif in the data.

15
Mixture models the M-step completed.
  • With these preliminaries, it is
    straightforward to maximize the expected
    complete-data log likelihood, given the
    observations X, evaluated at initial parameters
    ? and ?. Exercise Fill in the details missing
    below. We obtain
  • fjk cjk / ?k1L cjk ,
    j 0,1,,w k 1,,L.
  • In practice, care must be taken to avoid zero
    frequencies, so that either one uses explicit
    Dirichlet prior distributions, or one adds small
    constants ?j ,??j ?, giving
  • fjk (cjk ?j )/ (?k1L cjk ?) ,
    j 0,1,,w k 1,,L.

16
Comment on the EM algorithm
  • A common misconception concerning the EM
    algorithm is that we are estimating or predicting
    the missing data, plugging that estimate or
    prediction into the complete data log likelihood,
    completing the data you might say, and then
    maximizing this in the free parameters, as though
    we had complete data.
  • This is emphatically NOT what is going on.
  • A more accurate description might be this. We
    are using the inferred missing data to weight
    different parts of the complete data log
    likelihood differently in such a way that the
    pieces combine into the maximum likelihood
    estimates.

17
An Expectation Maximization (EM) Algorithm for
the Identification and Characterization of Common
Sites in Unaligned Biopolymer SequencesC E
Lawrence and A A ReillyPROTEINS Structure,
Function and Genetics 741-51 (1990)
  • This paper was the precursor of the
    Gibbs-Metropolis algorithm we described in
    Week11b. We now briefly describe the algorithm
    along the lines of the discussion just given, but
    in the notation used in Week11b. On p. .. of
    that lecture, the full data log likelihood was
    given, without the term for the marginal
    distribution of A being visible. We suppose that
    the ak are independent, and uniformly distributed
    over 1,..,nk-W1. Because this term does not
    depend on either ?0 or ?, it disappears in the
    likelihood proportionality constant. Thus the
    complete data log likelihood is
  • log L(?0,? R,A) h(RAc)log?0
    ?jh(RAj-1)log?j ,
  • where we use the notation ?log?0
    ?i?ilog?0,i cf p.12, Week 11b.

18
The E-step in this case.
  • The expected value of the complete data log
    likelihood given the observed data and initial
    parameters ?0 and ? is just
    ?A pr(A R, ?0,?) log pr(R, A ?0
    , ?) ()
  • where the sum is over all A a1,,aK, and
    so our task is to calculate the first term. Now
    we are treating all the rows (sequences) as
    mutually independent, so pr(A R, ?0,?)
    factorizes in k, and we need only deal with a
    typical row, the kth, say. Letting ak denote the
    random variable corresponding to the start of the
    motif in the kth row, then
  • pr(ak i)
    1/(nk-W1), i1, , nk-W1.
  • We can readily calculate pr(ak i ?0,?,
    R) by Bayes theorem, and these get multiplied
    and inserted in () above. Exercise Carry out
    this calculation.

19
The M-step in this case.
  • Once we have calculated the expected value of
    the complete data log likelihood given the
    observed data and initial parameters ?0 and ?,
    we then maximize it in the free variables ?0 and
    ?, leading to new parameters ?0 and ?.
  • How is this done? Without giving the gory
    details, just notice that () is a weighted
    combination of multinomial log likelihoods just
    like the one we met in our previous example, the
    mixture model. There the weights were Zs, and
    here they are pr(A R, ?0 ,?)s. It follows
    (Exercise Fill in the details) that the
    maximizing values of ?0 and ?, which we denote by
    ?0 and ?, are ratios of expected counts
    similar to the c0 and cj in the mixture
    discussion. As there, we will want to deal with
    small or zero counts by invoking Dirichlet priors.
Write a Comment
User Comments (0)
About PowerShow.com