Title: Mixture Language Models and EM Algorithm
1Mixture Language Models and EM Algorithm
- (Lecture for CS397-CXZ Intro Text Info Systems)
- Sept. 17, 2003
- ChengXiang Zhai
- Department of Computer Science
- University of Illinois, Urbana-Champaign
2Rest of this Lecture
- Unigram mixture models
- Slightly more sophisticated unigram LMs
- Related to smoothing
- EM algorithm
- VERY useful for estimating parameters of a
mixture model or when latent/hidden variables are
involved - Will occur again and again in the course
3Modeling a Multi-topic Document
A document with 2 types of vocabulary
text mining passage food nutrition passage
text mining passage text mining passage food
nutrition passage
How do we model such a document? How do we
generate such a document? How do we estimate
our model?
Solution A mixture model EM
4Simple Unigram Mixture Model
text 0.2 mining 0.1 assocation 0.01 clustering
0.02 food 0.00001
?0.7
Model/topic 1 p(w?1)
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
1-?0.3
Model/topic 2 p(w?2)
p(w?1? ?2) ?p(w?1)(1- ?)p(w?2)
5Parameter Estimation
Likelihood
- Estimation scenarios
- p(w?1) p(w?2) are known estimate ?
- p(w?1) ? are known estimate p(w?2)
- p(w?1) is known estimate ? p(w?2)
- ? is known estimate p(w?1) p(w?2)
- Estimate ?, p(w?1), p(w?2)
clustering
6Parameter Estimation ExampleGiven p(w?1) and
p(w?2), estimate ?
Maximum Likelihood
Expectation-Maximization (EM) Algorithm is a
commonly used method Basic idea Start from
some random guess of parameter values, and then
Iteratively improve our estimates (hill
climbing)
E-step compute the
lower bound M-step find a new ? that
maximizes the lower
bound
7EM Algorithm Intuition
text 0.2 mining 0.1 assocation 0.01 clustering
0.02 food 0.00001
Observed Doc d
??
p(w?1)
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
1-??
p(w?2)
Suppose we know the identity of each word
p(w?1? ?2) ?p(w?1)(1- ?)p(w?2)
8Can We Guess the Identity?
Identity (hidden) variable zw ?1 (w from ?1),
0(w from ?2)
zw 1 1 1 1 0 0 0 1 0 ...
Whats a reasonable guess? - depends on ?
(why?) - depends on p(w ?1) ) and p(w?2)
(how?)
the paper presents a text mining algorithm the pap
er ...
Initially, set ? to some random value, then
iterate
9An Example of EM Computation
10Any Theoretical Guarantee?
- EM is guaranteed to reach a LOCAL maximum
- When local maxima global maxima, EM can
find the global maximum - But, when there are multiple local maximas,
special techniques are needed (e.g., try
different initial values) - In our case, we have one unique local maxima
(why?)
11A General Introduction to EM
Data X (observed) H(hidden) Parameter ?
Incomplete likelihood L(? ) log p(X
?) Complete likelihood Lc(? ) log p(X,H ?)
EM tries to iteratively maximize the complete
likelihood Starting with an initial guess
?(0), 1. E-step compute the expectation of the
complete likelihood 2. M-step compute ?(n) by
maximizing the Q-function
12Convergence Guarantee
Goal maximizing Incomplete likelihood L(? )
log p(X ?) I.e., choosing ?(n), so
that L(?(n))-L(?(n-1))?0 Note that, since p(X,H
?) p(HX, ?) P(X ?) , L(?) Lc(?) -log p(HX,
?) L(?(n))-L(?(n-1)) Lc(?(n))-Lc(?
(n-1))log p(HX, ? (n-1) )/p(HX,
?(n)) Taking expectation w.r.t. p(HX, ?(n-1)),
L(?(n))-L(?(n-1)) Q(?(n) ?
(n-1))-Q(? (n-1) ? (n-1)) D(p(HX, ?
(n-1))p(HX, ? (n)))
EM chooses ?(n) to maximize Q
KL-divergence, always non-negative
Therefore, L(?(n)) ? L(?(n-1))!
13Another way of looking at EM
Likelihood p(X ?)
L(?(n-1)) Q(?? (n-1)) -Q(? (n-1) ? (n-1) )
D(p(HX, ? (n-1) )p(HX, ? ))
next guess
current guess
Lower bound (Q function)
?
E-step computing the lower bound M-step
maximizing the lower bound
14What You Should Know
- What is unigram language so important?
- What is a unigram mixture language model?
- How to estimate parameters of simple unigram
mixture models using EM - Know the general idea of EM (EM will be covered
again later in the course)