Title: Hidden Markov Models
1Hidden Markov Models
Slides by Bill Majoros, based on Methods for
Computational Gene Prediction
2What is an HMM?
- An HMM is a stochastic machine M(Q, ?, Pt, Pe)
consisting of the following - a finite set of states, Qq0, q1, ... , qm
- a finite alphabet ? s0, s1, ... , sn
- a transition distribution Pt QQ a
i.e., Pt (qj qi) - an emission distribution Pe Q? a
i.e., Pe (sj qi)
An Example
5
M1(q0,q1,q2,Y,R,Pt,Pe) Pt(q0,q1,1),
(q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05),
(q2,q2,0.7), (q2,q1,0.3) Pe(q1,Y,1),
(q1,R,0), (q2,Y,0), (q2,R,1)
15
Y0 R 100 q2
R0 Y 100 q1
q 0
80
30
70
100
3Probability of a Sequence
P(YRYRYM1) a0?1?b1,Y?a1?2?b2,R?a2?1?b1,Y?a1?2?
b2,R?a2?1?b1,Y?a1?0 1 ? 1 ? 0.15 ? 1 ? 0.3 ? 1
? 0.15 ? 1 ? 0.3 ? 1 ? 0.05 0.00010125
4Another Example
M2 (Q, ?, Pt, Pe) Q q0, q1, q2, q3, q4
? A, C, G, T
q2
65
A35 T25 C15 G25
q4
q1
35
50
A10 T30 C40 G20
A27 T14 C22 G37
100
q0
A11 T17 C43 G29
50
20
100
80
q3
5Finding the Most Probable Path
Finding the Most Probable Path
q2
65
Example C A T T A A T A G
A35 T25 C15 G25
q4
q1
50
35
A10 T30 C40 G20
A27 T14 C22 G37
top 7.010-7 bottom 2.810-9
100
q0
A11 T17 C43 G29
20
50
100
80
q3
The most probable path is States
122222224 Sequence CATTAATAG resulting in
this parse States 122222224 Sequence
CATTAATAG
feature 1 C feature 2 ATTAATA feature 3 G
6The Viterbi Algorithm
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
7Viterbi Traceback
T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0)
0
8The Forward Algorithm Probability of a Sequence
the single most probable path
Viterbi
sum over all paths
Forward
i.e.,
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
9Training an HMM from Labeled Sequences
transitions
emissions
From the textbook, Ch. 6.3
10Recall Eukaryotic Gene Structure
complete mRNA
coding segment
ATG
TGA
exon
exon
exon
intron
intron
. . .
. . .
. . .
AG
GT
AG
ATG
GT
TGA
start codon
stop codon
donor site
donor site
acceptor site
acceptor site
11Using an HMM for Gene Prediction
Intron
Donor
Acceptor
Exon
the Markov model
Start codon
Stop codon
Intergenic
q0
the input sequence
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTA
GTATAGGTCGATAGTACGCGA
the most probable path
the gene prediction
exon 1
exon 2
exon 3
12Higher Order Markovian Eukaryotic Recognizer
(HOMER)
H17
H5
H3
H95
H27
H77
13HOMER, version H3
Intron
Iintron state Eexon state Nintergenic state
Donor
Acceptor
?
Exon
Start codon
Stop codon
Intergenic
tested on 500 Arabidopsis genes
q0
14Recall Sensitivity and Precision
NOTE specificity as defined here and
throughout these slides (and the text) is really
precision
15HOMER, version H5
three exon states, for the three codon positions
16HOMER version H17
donor site
acceptor site
stop codon
start codon
17Maintaining Phase Across an Intron
01201201 2012012012
phase
GTATGCGATAGTCAAGAGTGATCGCTAGACC
sequence
0 5 10 15
20 25 30
coordinates
18HOMER version H27
three separate intron models
19Recall Weight Matrices
(stop codons)
T G A
(start codons)
T A A
A T G
T A G
(acceptor splice sites)
(donor splice sites)
A G
G T
20HOMER version H77
positional biases near splice sites
21HOMER version H95
22Summary of HOMER Results
23Higher-order Markov Models
P(G)
A C G C T A
0th order
P(GC)
A C G C T A
1st order
P(GAC)
A C G C T A
2nd order
24Higher-order Markov Models
0
1
2
3
4
5
25Summary
- An HMM is a stochastic generative model which
emits sequences - Parsing with an HMM can be accomplished using a
decoding algorithm (such as Viterbi) to find the
most probable (MAP) path generating the input
sequence - Training of unambiguous HMMs can be
accomplished using labeled sequence training - Training of ambiguous HMMs can be accomplished
using Viterbi training or the Baum-Welch
algorithm - Posterior decoding can be used to estimate the
probability that a given symbol or substring was
generate by a particular state (next lesson...)