Title: Hidden Markov Models
1Hidden Markov Models
Lecture 6, Thursday April 17, 2003
2Review of Last Lecture
Lecture 6, Thursday April 17, 2003
3Decoding
- GIVEN x x1x2xN
- We want to find ? ?1, , ?N,
- such that P x, ? is maximized
- ? argmax? P x, ?
- We can use dynamic programming!
- Let Vk(i) max?1,,i-1 Px1xi-1, ?1, , ?i-1,
xi, ?i k -
- Probability of most likely sequence of states
ending at state ?i k
4The Viterbi Algorithm
x1 x2 x3 ..xN
State 1
2
Vj(i)
K
- Similar to aligning a set of states to a
sequence - Time
- O(K2N)
- Space
- O(KN)
5Evaluation
- Compute
- P(x) Probability of x given the model
-
- P(xixj) Probability of a substring of x given
the model - P(?I k x) Probability that the ith state is
k, given x -
-
6The Forward Algorithm
- We can compute fk(i) for all k, i, using dynamic
programming! - Initialization
- f0(0) 1
- fk(0) 0, for all k gt 0
- Iteration
- fl(i) el(xi) ?k fk(i-1) akl
- Termination
- P(x) ?k fk(N) ak0
- Where, ak0 is the probability that the
terminating state is k (usually a0k)
7The Backward Algorithm
- We can compute bk(i) for all k, i, using dynamic
programming - Initialization
- bk(N) ak0, for all k
- Iteration
- bk(i) ?l el(xi1) akl bl(i1)
- Termination
-
- P(x) ?l a0l el(x1) bl(1)
8Posterior Decoding
- We can now calculate
-
- fk(i) bk(i)
- P(?i k x)
- P(x)
- Then, we can ask
- What is the most likely state at position i of
sequence x - Define ? by Posterior Decoding
- ?i argmaxk P(?i k x)
9Today
- Example CpG Islands
- Learining
10Implementation Techniques
- Viterbi Sum-of-logs
- Vl(i) log ek(xi) maxk Vk(i-1) log akl
- Forward/Backward Scaling by c(i)
- One way to perform scaling
- fl(i) c(i) ? el(xi) ?k fk(i-1) akl
- where c(i) 1/( ?k fk(i))
- bl(i) use the same factors c(i)
-
- Details in Rabiners Tutorial on HMMs, 1989
11A modeling Example
- CpG islands in DNA sequences
12Example CpG Islands
CpG nucleotides in the genome are frequently
methylated (Write CpG not to confuse with CG
base pair) C ? methyl-C ? T Methylation
often suppressed around genes, promoters ?
CpG islands
13Example CpG Islands
- In CpG islands,
- CG is more frequent
- Other pairs (AA, AG, AT) have different
frequencies - Question Detect CpG islands computationally
14A model of CpG Islands (1) Architecture
A
C
G
T
CpG Island
A-
C-
G-
T-
Not CpG Island
15A model of CpG Islands (2) Transitions
- How do we estimate the parameters of the model?
- Emission probabilities 1/0
- Transition probabilities within CpG islands
- Established from many known (experimentally
verified) - CpG islands
- (Training Set)
- Transition probabilities within other regions
- Established from many known non-CpG islands
A C G T
A .180 .274 .426 .120
C .171 .368 .274 .188
G .161 .339 .375 .125
T .079 .355 .384 .182
- A C G T
A .300 .205 .285 .210
C .233 .298 .078 .302
G .248 .246 .298 .208
T .177 .239 .292 .292
16Parenthesis log likelihoods
A better way to see effects of transitions Log
likelihoods L(u, v) log P(uv ) / P(uv
-) Given a region x x1xN A quick--dirty
way to decide whether entire x is CpG P(x is
CpG) gt P(x is not CpG) ? ?i L(xi, xi1) gt 0
A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
G -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679
17A model of CpG Islands (2) Transitions
- What about transitions between () and (-)
states? - They affect
- Avg. length of CpG island
- Avg. separation between two CpG islands
Length distribution of region X PlX 1
1-p PlX 2 p(1-p) PlX k
pk(1-p) ElX 1/(1-p) Exponential
distribution, with mean 1/(1-p)
1-p
X
Y
p
q
1-q
18A model of CpG Islands (2) Transitions
- No reason to favor exiting/entering () and (-)
regions at a particular nucleotide - To determine transition probabilities between ()
and (-) states - Estimate average length of a CpG island lCPG
1/(1-p) ? p 1 1/lCPG - For each pair of () states k, l, let akl ? p
akl - For each () state k, (-) state l, let akl
(1-p)/4 - (better take frequency of l in the (-) regions
into account) - Do the same for (-) states
- A problem with this model CpG islands dont have
exponential length distribution - This is a defect of HMMs compensated with ease
of analysis computation
19Applications of the model
- Given a DNA region x,
- The Viterbi algorithm predicts locations of CpG
islands - Given a nucleotide xi, (say xi A)
- The Viterbi parse tells whether xi is in a CpG
island in the most likely general scenario - The Forward/Backward algorithms can calculate
- P(xi is in CpG island) P(?i A x)
- Posterior Decoding can assign locally optimal
predictions of CpG islands - ?i argmaxk P(?i k x)
20What if a new genome comes?
- We just sequenced the porcupine genome
- We know CpG islands play the same role in this
genome - However, we have no known CpG islands for
porcupines - We suspect the frequency and characteristics of
CpG islands are quite different in porcupines - How do we adjust the parameters in our model?
- - LEARNING
21Problem 3 Learning
- Re-estimate the parameters of the model based on
training data
22Two learning scenarios
- Estimation when the right answer is known
- Examples
- GIVEN a genomic region x x1x1,000,000 where
we have good (experimental) annotations of the
CpG islands -
- GIVEN the casino player allows us to observe
him one evening, as he changes dice and
produces 10,000 rolls -
- Estimation when the right answer is unknown
- Examples
- GIVEN the porcupine genome we dont know how
frequent are the CpG islands there, neither do
we know their composition - GIVEN 10,000 rolls of the casino player, but
we dont see when he changes dice - QUESTION Update the parameters ? of the model to
maximize P(x?)
231. When the right answer is known
- Given x x1xN
- for which the true ? ?1?N is known,
- Define
- Akl times k?l transition occurs in ?
- Ek(b) times state k in ? emits b in x
- We can show that the maximum likelihood
parameters ? are - Akl Ek(b)
- akl ek(b)
- ?i Aki ?c Ek(c)
241. When the right answer is known
- Intuition When we know the underlying states,
- Best estimate is the average frequency
of - transitions emissions that occur in
the training data - Drawback
- Given little data, there may be overfitting
- P(x?) is maximized, but ? is unreasonable
- 0 probabilities VERY BAD
- Example
- Given 10 casino rolls, we observe
- x 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
- ? F, F, F, F, F, F, F, F, F, F
- Then
- aFF 1 aFL 0
- eF(1) eF(3) .2
- eF(2) .3 eF(4) 0 eF(5) eF(6) .1
25Pseudocounts
- Solution for small training sets
- Add pseudocounts
- Akl times k?l transition occurs in ? rkl
- Ek(b) times state k in ? emits b in x
rk(b) - rkl, rk(b) are pseudocounts representing our
prior belief - Larger pseudocounts ? Strong priof belief
- Small pseudocounts (? lt 1) just to avoid 0
probabilities
26Pseudocounts
- Example dishonest casino
- We will observe player for one day, 500 rolls
- Reasonable pseudocounts
-
- r0F r0L rF0 rL0 1
- rFL rLF rFF rLL 1
- rF(1) rF(2) rF(6) 20 (strong belief
fair is fair) - rF(1) rF(2) rF(6) 5 (wait and see for
loaded) - Above s pretty arbitrary assigning priors is
an art
272. When the right answer is unknown
- We dont know the true Akl, Ek(b)
- Idea
- We estimate our best guess on what Akl, Ek(b)
are - We update the parameters of the model, based on
our guess - We repeat
282. When the right answer is unknown
- Starting with our best guess of a model M,
parameters ? - Given x x1xN
- for which the true ? ?1?N is unknown,
- We can get to a provably more likely parameter
set ? - Principle EXPECTATION MAXIMIZATION
- Estimate Akl, Ek(b) in the training data
- Update ? according to Akl, Ek(b)
- Repeat 1 2, until convergence
29Estimating new parameters
- To estimate Akl
- At each position i of sequence x,
- Find probability transition k?l is used
- P(?i k, ?i1 l x) 1/P(x) ? P(?i k,
?i1 l, x1xN) Q/P(x) - where Q P(x1xi, ?i k, ?i1 l, xi1xN)
- P(?i1 l, xi1xN ?i k) P(x1xi, ?i
k) - P(?i1 l, xi1xi2xN ?i k) fk(i)
- P(xi2xN ?i1 l) P(xi1 ?i1 l)
P(?i1 l ?i k) fk(i) - bl(i1) el(xi1) akl fk(i)
- fk(i) akl el(xi1) bl(i1)
- So P(?i k, ?i1 l x, ?)
- P(x ?)
30Estimating new parameters
- So,
- fk(i) akl
el(xi1) bl(i1) - Akl ?i P(?i k, ?i1 l x, ?) ?i
- P(x ?)
- Similarly,
- Ek(b) 1/P(x)? i xi b fk(i)
bk(i)
31Estimating new parameters
- If we have several training sequences, x1, , xM,
each of length N, - fk(i) akl el(xi1)
bl(i1) - Akl ?j ?i P(?i k, ?i1 l x, ?) ?j ?i
- P(x ?)
- Similarly,
- Ek(b) ?j (1/P(xj))? i xi b fk(i)
bk(i)
32The Baum-Welch Algorithm
- Initialization
- Pick the best-guess for model parameters
- (or arbitrary)
- Iteration
- Forward
- Backward
- Calculate Akl, Ek(b)
- Calculate new model parameters akl, ek(b)
- Calculate new log-likelihood P(x ?)
- GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZAT
ION - Until P(x ?) does not change much
33The Baum-Welch Algorithm comments
- Time Complexity
-
- iterations ? O(K2N)
- Guaranteed to increase the log likelihood of the
model - P(? x) P(x, ?) / P(x) P(x ?) / ( P(x)
P(?) ) - Not guaranteed to find globally best parameters
- Converges to local optimum, depending on initial
conditions - Too many parameters / too large
model Overtraining
34Alternative Viterbi Training
- Initialization Same
- Iteration
- Perform Viterbi, to find ?
- Calculate Akl, Ek(b) according to ?
pseudocounts - Calculate the new parameters akl, ek(b)
- Until convergence
- Notes
- Convergence is guaranteed Why?
- Does not maximize P(x ?)
- In general, worse performance than Baum-Welch
- Convenient when interested in Viterbi parsing,
no need to implement additional procedures
(Forward, Backward)!!
35Exercise Submit any time Groups up to 3
- Implement a HMM for the dishonest casino (or any
other simple process you feel like) - Generate training sequences with the model
- Implement Baum-Welch and Viterbi training
- Show a few sets of initial parameters such that
- Baum-Welch and Viterbi differ significantly,
and/or - Baum-Welch converges to parameters close to the
model, and to unreasonable parameters, depending
on initial parameters - Do not use 0-probability transitions
- Do not use 0s in the initial parameters
- Do use pseudocounts in Viterbi
- This exercise will replace the 1-3 lowest
problems, depending on thoroughness