Hidden Markov Models - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Hidden Markov Models

Description:

Lecture 6, Thursday April 17, 2003. Lecture 6, Thursday April 17, ... Parenthesis log likelihoods -0.679 0.393 0.573 -1.169. T -0.730 0.331 0.461 -0.624 ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 36

Provided by: Sera4

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models

1
Hidden Markov Models
Lecture 6, Thursday April 17, 2003
2
Review of Last Lecture
Lecture 6, Thursday April 17, 2003
3
Decoding

GIVEN x x1x2xN
We want to find ? ?1, , ?N,
such that P x, ? is maximized
? argmax? P x, ?
We can use dynamic programming!
Let Vk(i) max?1,,i-1 Px1xi-1, ?1, , ?i-1,
xi, ?i k
Probability of most likely sequence of states
ending at state ?i k

4
The Viterbi Algorithm
x1 x2 x3 ..xN

State 1
2
Vj(i)
K

Similar to aligning a set of states to a
sequence
Time
O(K2N)
Space
O(KN)

5
Evaluation

Compute
P(x) Probability of x given the model
P(xixj) Probability of a substring of x given
the model
P(?I k x) Probability that the ith state is
k, given x

6
The Forward Algorithm

We can compute fk(i) for all k, i, using dynamic
programming!
Initialization
f0(0) 1
fk(0) 0, for all k gt 0
Iteration
fl(i) el(xi) ?k fk(i-1) akl
Termination
P(x) ?k fk(N) ak0
Where, ak0 is the probability that the
terminating state is k (usually a0k)

7
The Backward Algorithm

We can compute bk(i) for all k, i, using dynamic
programming
Initialization
bk(N) ak0, for all k
Iteration
bk(i) ?l el(xi1) akl bl(i1)
Termination
P(x) ?l a0l el(x1) bl(1)

8
Posterior Decoding

We can now calculate
fk(i) bk(i)
P(?i k x)
P(x)
Then, we can ask
What is the most likely state at position i of
sequence x
Define ? by Posterior Decoding
?i argmaxk P(?i k x)

9
Today

Example CpG Islands
Learining

10
Implementation Techniques

Viterbi Sum-of-logs
Vl(i) log ek(xi) maxk Vk(i-1) log akl
Forward/Backward Scaling by c(i)
One way to perform scaling
fl(i) c(i) ? el(xi) ?k fk(i-1) akl
where c(i) 1/( ?k fk(i))
bl(i) use the same factors c(i)
Details in Rabiners Tutorial on HMMs, 1989

11
A modeling Example

CpG islands in DNA sequences

12
Example CpG Islands
CpG nucleotides in the genome are frequently
methylated (Write CpG not to confuse with CG
base pair) C ? methyl-C ? T Methylation
often suppressed around genes, promoters ?
CpG islands
13
Example CpG Islands

In CpG islands,
CG is more frequent
Other pairs (AA, AG, AT) have different
frequencies
Question Detect CpG islands computationally

14
A model of CpG Islands (1) Architecture
A
C
G
T
CpG Island
A-
C-
G-
T-
Not CpG Island
15
A model of CpG Islands (2) Transitions

How do we estimate the parameters of the model?
Emission probabilities 1/0
Transition probabilities within CpG islands
Established from many known (experimentally
verified)
CpG islands
(Training Set)
Transition probabilities within other regions
Established from many known non-CpG islands

A C G T
A .180 .274 .426 .120
C .171 .368 .274 .188
G .161 .339 .375 .125
T .079 .355 .384 .182
- A C G T
A .300 .205 .285 .210
C .233 .298 .078 .302
G .248 .246 .298 .208
T .177 .239 .292 .292
16
Parenthesis log likelihoods
A better way to see effects of transitions Log
likelihoods L(u, v) log P(uv ) / P(uv
-) Given a region x x1xN A quick--dirty
way to decide whether entire x is CpG P(x is
CpG) gt P(x is not CpG) ? ?i L(xi, xi1) gt 0
A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
G -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679
17
A model of CpG Islands (2) Transitions

What about transitions between () and (-)
states?
They affect
Avg. length of CpG island
Avg. separation between two CpG islands

Length distribution of region X PlX 1
1-p PlX 2 p(1-p) PlX k
pk(1-p) ElX 1/(1-p) Exponential
distribution, with mean 1/(1-p)
1-p
X
Y
p
q
1-q
18
A model of CpG Islands (2) Transitions

No reason to favor exiting/entering () and (-)
regions at a particular nucleotide
To determine transition probabilities between ()
and (-) states
Estimate average length of a CpG island lCPG
1/(1-p) ? p 1 1/lCPG
For each pair of () states k, l, let akl ? p
akl
For each () state k, (-) state l, let akl
(1-p)/4
(better take frequency of l in the (-) regions
into account)
Do the same for (-) states
A problem with this model CpG islands dont have
exponential length distribution
This is a defect of HMMs compensated with ease
of analysis computation

19
Applications of the model

Given a DNA region x,
The Viterbi algorithm predicts locations of CpG
islands
Given a nucleotide xi, (say xi A)
The Viterbi parse tells whether xi is in a CpG
island in the most likely general scenario
The Forward/Backward algorithms can calculate
P(xi is in CpG island) P(?i A x)
Posterior Decoding can assign locally optimal
predictions of CpG islands
?i argmaxk P(?i k x)

20
What if a new genome comes?

We just sequenced the porcupine genome
We know CpG islands play the same role in this
genome
However, we have no known CpG islands for
porcupines
We suspect the frequency and characteristics of
CpG islands are quite different in porcupines
How do we adjust the parameters in our model?
- LEARNING

21
Problem 3 Learning

Re-estimate the parameters of the model based on
training data

22
Two learning scenarios

Estimation when the right answer is known
Examples
GIVEN a genomic region x x1x1,000,000 where
we have good (experimental) annotations of the
CpG islands
GIVEN the casino player allows us to observe
him one evening, as he changes dice and
produces 10,000 rolls
Estimation when the right answer is unknown
Examples
GIVEN the porcupine genome we dont know how
frequent are the CpG islands there, neither do
we know their composition
GIVEN 10,000 rolls of the casino player, but
we dont see when he changes dice
QUESTION Update the parameters ? of the model to
maximize P(x?)

23
1. When the right answer is known

Given x x1xN
for which the true ? ?1?N is known,
Define
Akl times k?l transition occurs in ?
Ek(b) times state k in ? emits b in x
We can show that the maximum likelihood
parameters ? are
Akl Ek(b)
akl ek(b)
?i Aki ?c Ek(c)

24
1. When the right answer is known

Intuition When we know the underlying states,
Best estimate is the average frequency
of
transitions emissions that occur in
the training data
Drawback
Given little data, there may be overfitting
P(x?) is maximized, but ? is unreasonable
0 probabilities VERY BAD
Example
Given 10 casino rolls, we observe
x 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
? F, F, F, F, F, F, F, F, F, F
Then
aFF 1 aFL 0
eF(1) eF(3) .2
eF(2) .3 eF(4) 0 eF(5) eF(6) .1

25
Pseudocounts

Solution for small training sets
Add pseudocounts
Akl times k?l transition occurs in ? rkl
Ek(b) times state k in ? emits b in x
rk(b)
rkl, rk(b) are pseudocounts representing our
prior belief
Larger pseudocounts ? Strong priof belief
Small pseudocounts (? lt 1) just to avoid 0
probabilities

26
Pseudocounts

Example dishonest casino
We will observe player for one day, 500 rolls
Reasonable pseudocounts
r0F r0L rF0 rL0 1
rFL rLF rFF rLL 1
rF(1) rF(2) rF(6) 20 (strong belief
fair is fair)
rF(1) rF(2) rF(6) 5 (wait and see for
loaded)
Above s pretty arbitrary assigning priors is
an art

27
2. When the right answer is unknown

We dont know the true Akl, Ek(b)
Idea
We estimate our best guess on what Akl, Ek(b)
are
We update the parameters of the model, based on
our guess
We repeat

28
2. When the right answer is unknown

Starting with our best guess of a model M,
parameters ?
Given x x1xN
for which the true ? ?1?N is unknown,
We can get to a provably more likely parameter
set ?
Principle EXPECTATION MAXIMIZATION
Estimate Akl, Ek(b) in the training data
Update ? according to Akl, Ek(b)
Repeat 1 2, until convergence

29
Estimating new parameters

To estimate Akl
At each position i of sequence x,
Find probability transition k?l is used
P(?i k, ?i1 l x) 1/P(x) ? P(?i k,
?i1 l, x1xN) Q/P(x)
where Q P(x1xi, ?i k, ?i1 l, xi1xN)
P(?i1 l, xi1xN ?i k) P(x1xi, ?i
k)
P(?i1 l, xi1xi2xN ?i k) fk(i)
P(xi2xN ?i1 l) P(xi1 ?i1 l)
P(?i1 l ?i k) fk(i)
bl(i1) el(xi1) akl fk(i)
fk(i) akl el(xi1) bl(i1)
So P(?i k, ?i1 l x, ?)
P(x ?)

30
Estimating new parameters

So,
fk(i) akl
el(xi1) bl(i1)
Akl ?i P(?i k, ?i1 l x, ?) ?i
P(x ?)
Similarly,
Ek(b) 1/P(x)? i xi b fk(i)
bk(i)

31
Estimating new parameters

If we have several training sequences, x1, , xM,
each of length N,
fk(i) akl el(xi1)
bl(i1)
Akl ?j ?i P(?i k, ?i1 l x, ?) ?j ?i
P(x ?)
Similarly,
Ek(b) ?j (1/P(xj))? i xi b fk(i)
bk(i)

32
The Baum-Welch Algorithm

Initialization
Pick the best-guess for model parameters
(or arbitrary)
Iteration
Forward
Backward
Calculate Akl, Ek(b)
Calculate new model parameters akl, ek(b)
Calculate new log-likelihood P(x ?)
GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZAT
ION
Until P(x ?) does not change much

33
The Baum-Welch Algorithm comments

Time Complexity
iterations ? O(K2N)
Guaranteed to increase the log likelihood of the
model
P(? x) P(x, ?) / P(x) P(x ?) / ( P(x)
P(?) )
Not guaranteed to find globally best parameters
Converges to local optimum, depending on initial
conditions
Too many parameters / too large
model Overtraining

34
Alternative Viterbi Training

Initialization Same
Iteration
Perform Viterbi, to find ?
Calculate Akl, Ek(b) according to ?
pseudocounts
Calculate the new parameters akl, ek(b)
Until convergence
Notes
Convergence is guaranteed Why?
Does not maximize P(x ?)
In general, worse performance than Baum-Welch
Convenient when interested in Viterbi parsing,
no need to implement additional procedures
(Forward, Backward)!!

35
Exercise Submit any time Groups up to 3

Implement a HMM for the dishonest casino (or any
other simple process you feel like)
Generate training sequences with the model
Implement Baum-Welch and Viterbi training
Show a few sets of initial parameters such that
Baum-Welch and Viterbi differ significantly,
and/or
Baum-Welch converges to parameters close to the
model, and to unreasonable parameters, depending
on initial parameters
Do not use 0-probability transitions
Do not use 0s in the initial parameters
Do use pseudocounts in Viterbi
This exercise will replace the 1-3 lowest
problems, depending on thoroughness