Hidden Markov Models - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Hidden Markov Models

Description:

three exon states, for the three codon positions. nucleotides. splice sites. start/stop codons ... (start codons) (donor splice sites) (acceptor splice sites) ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 26

Provided by: billma4

Category:

more less

Transcript and Presenter's Notes

Title: Hidden Markov Models

1
Hidden Markov Models
Slides by Bill Majoros, based on Methods for
Computational Gene Prediction
2
What is an HMM?

An HMM is a stochastic machine M(Q, ?, Pt, Pe)
consisting of the following
a finite set of states, Qq0, q1, ... , qm
a finite alphabet ? s0, s1, ... , sn
a transition distribution Pt QQ a
i.e., Pt (qj qi)
an emission distribution Pe Q? a
i.e., Pe (sj qi)

An Example
5
M1(q0,q1,q2,Y,R,Pt,Pe) Pt(q0,q1,1),
(q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05),
(q2,q2,0.7), (q2,q1,0.3) Pe(q1,Y,1),
(q1,R,0), (q2,Y,0), (q2,R,1)
15
Y0 R 100 q2
R0 Y 100 q1
q 0
80
30
70
100
3
Probability of a Sequence
P(YRYRYM1) a0?1?b1,Y?a1?2?b2,R?a2?1?b1,Y?a1?2?
b2,R?a2?1?b1,Y?a1?0 1 ? 1 ? 0.15 ? 1 ? 0.3 ? 1
? 0.15 ? 1 ? 0.3 ? 1 ? 0.05 0.00010125
4
Another Example
M2 (Q, ?, Pt, Pe) Q q0, q1, q2, q3, q4
? A, C, G, T
q2
65
A35 T25 C15 G25
q4
q1
35
50
A10 T30 C40 G20
A27 T14 C22 G37
100
q0
A11 T17 C43 G29
50
20
100
80
q3
5
Finding the Most Probable Path
Finding the Most Probable Path
q2
65
Example C A T T A A T A G
A35 T25 C15 G25
q4
q1
50
35
A10 T30 C40 G20
A27 T14 C22 G37
top 7.010-7 bottom 2.810-9
100
q0
A11 T17 C43 G29
20
50
100
80
q3
The most probable path is States
122222224 Sequence CATTAATAG resulting in
this parse States 122222224 Sequence
CATTAATAG
feature 1 C feature 2 ATTAATA feature 3 G
6
The Viterbi Algorithm
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
7
Viterbi Traceback
T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0)
0
8
The Forward Algorithm Probability of a Sequence
the single most probable path
Viterbi
sum over all paths
Forward
i.e.,
. . .
. . .
sequence
k
k1
k-1
k-2
states
(i,k)
. . .
9
Training an HMM from Labeled Sequences
transitions
emissions
From the textbook, Ch. 6.3
10
Recall Eukaryotic Gene Structure
complete mRNA
coding segment
ATG
TGA
exon
exon
exon
intron
intron
. . .
. . .
. . .
AG
GT
AG
ATG
GT
TGA
start codon
stop codon
donor site
donor site
acceptor site
acceptor site
11
Using an HMM for Gene Prediction
Intron
Donor
Acceptor
Exon
the Markov model
Start codon
Stop codon
Intergenic
q0
the input sequence
AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTA
GTATAGGTCGATAGTACGCGA
the most probable path
the gene prediction
exon 1
exon 2
exon 3
12
Higher Order Markovian Eukaryotic Recognizer
(HOMER)
H17
H5
H3
H95
H27
H77
13
HOMER, version H3
Intron
Iintron state Eexon state Nintergenic state
Donor
Acceptor
?
Exon
Start codon
Stop codon
Intergenic
tested on 500 Arabidopsis genes
q0
14
Recall Sensitivity and Precision
NOTE specificity as defined here and
throughout these slides (and the text) is really
precision
15
HOMER, version H5
three exon states, for the three codon positions
16
HOMER version H17
donor site
acceptor site
stop codon
start codon
17
Maintaining Phase Across an Intron
01201201 2012012012
phase
GTATGCGATAGTCAAGAGTGATCGCTAGACC

sequence

0 5 10 15
20 25 30
coordinates
18
HOMER version H27
three separate intron models
19
Recall Weight Matrices
(stop codons)
T G A
(start codons)
T A A
A T G
T A G
(acceptor splice sites)
(donor splice sites)
A G
G T
20
HOMER version H77
positional biases near splice sites
21
HOMER version H95
22
Summary of HOMER Results
23
Higher-order Markov Models
P(G)
A C G C T A
0th order
P(GC)
A C G C T A
1st order
P(GAC)
A C G C T A
2nd order
24
Higher-order Markov Models
0
1
2
3
4
5
25
Summary

An HMM is a stochastic generative model which
emits sequences
Parsing with an HMM can be accomplished using a
decoding algorithm (such as Viterbi) to find the
most probable (MAP) path generating the input
sequence
Training of unambiguous HMMs can be
accomplished using labeled sequence training
Training of ambiguous HMMs can be accomplished
using Viterbi training or the Baum-Welch
algorithm
Posterior decoding can be used to estimate the
probability that a given symbol or substring was
generate by a particular state (next lesson...)