Title: MNW2 course Introduction to Bioinformatics
1MNW2 courseIntroduction to Bioinformatics
- Lecture 22 Markov models
- Centre for Integrative Bioinformatics
- FEW/FALW
- heringa_at_cs.vu.nl
2Problem in biology
- Data and patterns are often not clear cut
- When we want to make a method to recognise a
pattern (e.g. a sequence motif), we have to learn
from the data (e.g. maybe there are other
differences between sequences that have the
pattern and those that do not) - This leads to Data mining and Machine learning
3A widely used machine learning approach Markov
models
- Contents
- Markov chain models (1st order, higher order and
- inhomogeneous models parameter estimation
classification) - Interpolated Markov models (and back-off
models) - Hidden Markov models (forward, backward and
Baum- - Welch algorithms model topologies applications
to gene - finding and protein family modeling
4(No Transcript)
5Markov Chain Models
- a Markov chain model is defined by
- a set of states
- some states emit symbols
- other states (e.g. the begin state) are silent
- a set of transitions with associated
probabilities - the transitions emanating from a given state
define a distribution over the possible next
states
6Markov Chain Models
- given some sequence x of length L, we can ask how
probable the sequence is given our model - for any probabilistic model of sequences, we can
write this probability as - key property of a (1st order) Markov chain the
probability of each Xi depends only on Xi-1
7Markov Chain Models
Pr(cggt) Pr(c)Pr(gc)Pr(gg)Pr(tg)
8Markov Chain Models
- Can also have an end state, allowing the model to
represent - Sequences of different lengths
- Preferences for sequences ending with particular
symbols
9Markov Chain Models
The transition parameters can be denoted by
where Similarly we can denote the probability of
a sequence x as Where aBxi represents the
transition from the begin state
10Example Application
- CpG islands
- CGdinucleotides are rarer in eukaryotic genomes
than expected given the independent probabilities
of C, G - but the regions upstream of genes are richer in
CG dinucleotides than elsewhere CpG islands - useful evidence for finding genes
- Could predict CpG islands with Markov chains
- one to represent CpG islands
- one to represent the rest of the genome
- Example includes using Maximum likelihood and
Bayes statistical data and feeding it to a HM
model
11Estimating the Model Parameters
- Given some data (e.g. a set of sequences from CpG
islands), how can we determine the probability
parameters of our model? - One approach maximum likelihood estimation
- given a set of data D
- set the parameters ? to maximize
- Pr(D ?)
- i.e. make the data D look likely under the model
12Maximum Likelihood Estimation
- Suppose we want to estimate the parameters Pr(a),
Pr(c), Pr(g), Pr(t) - And were given the sequences
- accgcgctta
- gcttagtgac
- tagccgttac
- Then the maximum likelihood estimates are
-
- Pr(a) 6/30 0.2 Pr(g) 7/30 0.233
- Pr(c) 9/30 0.3 Pr(t) 8/30 0.267
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17These data are derived from genome sequences
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Higher Order Markov Chains
- An nth order Markov chain over some alphabet is
equivalent to a first order Markov chain over the
alphabet of n-tuples - Example a 2nd order Markov model for DNA can be
treated as a 1st order Markov model over
alphabet - AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT,
TA, TC, TG, and TT (i.e. all possible dipeptides)
22A Fifth Order Markov Chain
23Inhomogenous Markov Chains
- In the Markov chain models we have considered so
far, the probabilities do not depend on where we
are in a given sequence - In an inhomogeneous Markov model, we can have
different distributions at different positions in
the sequence - Consider modeling codons in protein coding
regions
24Inhomogenous Markov Chains
25A Fifth Order InhomogenousMarkov Chain
26Selecting the Order of aMarkov Chain Model
- Higher order models remember more history
- Additional history can have predictive value
- Example
- predict the next word in this sentence
fragment finish __ (up, it, first, last, ?) - now predict it given more history
- Fast guys finish __
27Selecting the Order of aMarkov Chain Model
- However, the number of parameters we need to
estimate grows exponentially with the order - for modeling DNA we need parameters for an nth
order model, with n ? 5 normally - The higher the order, the less reliable we can
expect our parameter estimates to be - estimating the parameters of a 2nd order
homogenous Markov chain from the complete genome
of E. Coli, we would see each word gt 72,000 times
on average - estimating the parameters of an 8th order
chain, we would see each word 5 times on
average
28Interpolated Markov Models
- The IMM idea manage this trade-off by
interpolating among models of various orders - Simple linear interpolation
29Interpolated Markov Models
- We can make the weights depend on the history
- for a given order, we may have significantly
more data to estimate some words than others - General linear interpolation
30Gene Finding Search by Content
- Encoding a protein affects the statistical
properties of a DNA sequence - some amino acids are used more frequently than
others (Leu more popular than Trp) - different numbers of codons for different
amino acids (Leu has 6, Trp has 1) - for a given amino acid, usually one codon is
used more frequently than others - This is termed codon preference
- Codon preferences vary by species
31Codon Preference in E. Coli
- AA codon /1000
- ----------------------
- Gly GGG 1.89
- Gly GGA 0.44
- Gly GGU 52.99
- Gly GGC 34.55
- Glu GAG 15.68
- Glu GAA 57.20
- Asp GAU 21.63
- Asp GAC 43.26
32Search by Content
Common way to search by content build
Markov models of coding noncoding regions
apply models to ORFs (Open Reading Frames) or
fixed- sized windows of sequence GeneMark
Borodovsky et al. popular system for
identifying genes in bacterial genomes uses
5th order inhomogenous Markov chain models
33The GLIMMER System
- Salzberg et al., 1998
- System for identifying genes in bacterial genomes
- Uses 8th order, inhomogeneous, interpolated
Markov chain models
34IMMs in GLIMMER
- How does GLIMMER determine the values?
- First, let us express the IMM probability
calculation recursively
35IMMs in GLIMMER
- If we havent seen xi-1 xi-n more than 400
times, then compare the counts for the following
Use a statistical test ( ?2) to get a value d
indicating our confidence that the distributions
represented by the two sets of counts are
different
36IMMs in GLIMMER
?2 score when comparing nth-order with
n-1th-order Markov model (preceding slide)
37The GLIMMER method
- 8th order IMM vs. 5th order Markov model
- Trained on 1168 genes (ORFs really)
- Tested on 1717 annotated (more or less known)
genes
38(No Transcript)
39(No Transcript)
40Plot sensitivity over 1-specificity
41(No Transcript)
42(No Transcript)
43Hidden Markov models (HMMs)
Given say a T in our input sequence, which state
emitted it?
44Hidden Markov models (HMMs)
- Hidden State
- We will distinguish between the observed parts
of a problem and the hidden parts - In the Markov models we have considered
previously, it is clear which state accounts for
each part of the observed sequence - In the model above (preceding slide), there are
multiple states that could account for each part
of the observed sequence - this is the hidden part of the problem
- states are decoupled from sequence symbols
45HMM-based homology searching
HMM for ungapped alignment Transition
probabilities and Emission probabilities Gapped
HMMs also have insertion and deletion states
(next slide)
46Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
Model for alignment with insertions and deletions
47HMM-based homology searching
- Most widely used HMM-based profile searching
tools currently are SAM-T99 (Karplus et al.,
1998) and HMMER2 (Eddy, 1998) - formal probabilistic basis and consistent theory
behind gap and insertion scores - HMMs good for profile searches, bad for alignment
(due to parametrisation of the models) - HMMs are slow
48Homology-derived Secondary Structure of Proteins
(HSSP) Sander Schneider, 1991
Its all about trying to push dont know region
down
49The Parameters of an HMM
50HMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden
Markov Models for Biological Sequences
51A Simple HMM
52Three Important Questions
- How likely is a given sequence?
- the Forward algorithm
- What is the most probable path for generating a
given sequence? - the Viterbi algorithm
- How can we learn the HMM parameters given a set
of sequences? - the Forward-Backward (Baum-Welch) algorithm
53Three Important Questions
- How likely is a given sequence?
- Forward algorithm
- What is the most probable path for generating a
given sequence? - How can we learn the HMM parameters given a set
of sequences?
54How Likely is a Given Sequence?
- The probability that the path is taken and the
sequence is generated - (assuming begin/end are the only silent states on
path)
55How Likely is a Given Sequence?
56How Likely is a Given Sequence?
The probability over all paths is but the
number of paths can be exponential in the length
of the sequence... the Forward algorithm
enables us to compute this efficiently
57How Likely is a Given SequenceThe Forward
Algorithm
- Define fk(i) to be the probability of being in
state k - Having observed the first i characters of x we
want to compute fN(L), the probability of being
in the end state having observed all of x - We can define this recursively
58How Likely is a Given Sequence
59The forward algorithm
- Initialisation
- f0(0) 1 (start),
- fk(0) 0 (other silent states k)
- Recursion fl(i) el(i)?k fk(i-1)akl
(emitting states), - fl(i) ?k fk(i)akl (silent states)
- Termination
- Pr(x) Pr(x1xL) f N(L) ?k fk(L)akN
probability that were in start state and have
observed 0 characters from the sequence
probability that we are in the end state and have
observed the entire sequence
60Forward algorithm example
61Three Important Questions
- How likely is a given sequence?
- What is the most probable path for generating a
given sequence? - Viterbi algorithm
- How can we learn the HMM parameters given a set
of sequences?
62Finding the Most Probable PathThe Viterbi
Algorithm
- Define vk(i) to be the probability of the most
probable path accounting for the first i
characters of x and ending in state k - We want to compute vN(L), the probability of the
most probable path accounting for all of the
sequence and ending in the end state - Can be defined recursively
- Can use DP to find vN(L) efficiently
63Finding the Most Probable PathThe Viterbi
Algorithm
- Initialisation
- v0(0) 1 (start), vk(0) 0 (non-silent states)
- Recursion for emitting states (i 1L)
-
- Recursion for silent states
64Finding the Most Probable PathThe Viterbi
Algorithm
65Three Important Questions
- How likely is a given sequence? (clustering)
- What is the most probable path for generating a
given sequence? (alignment) - How can we learn the HMM parameters given a set
of sequences? - The Baum-Welch Algorithm
66The Learning Task
- Given
- a model
- a set of sequences (the training set)
- Do
- find the most likely parameters to explain the
training sequences - The goal is find a model that generalizes well to
sequences we havent seen before
67Learning Parameters
- If we know the state path for each training
sequence, learning the model parameters is simple - no hidden state during training
- count how often each parameter is used
- normalize/smooth to get probabilities
- process just like it was for Markov chain
models - If we dont know the path for each training
sequence, how can we determine the counts? - key insight estimate the counts by
considering every path weighted by its
probability
68Learning ParametersThe Baum-Welch Algorithm
- An EM (expectation maximization) approach, a
forward-backward algorithm - Algorithm sketch
- initialize parameters of model
- iterate until convergence
- Calculate the expected number of times each
transition or emission is used - Adjust the parameters to maximize the likelihood
of these expected values
69The Expectation step
70The Expectation step
71The Expectation step
72The Expectation step
73The Expectation step
- First, we need to know the probability of the i
th symbol being produced by state q, given
sequence x - Pr( ?i k x)
- Given this we can compute our expected counts for
state transitions, character emissions
74The Expectation step
75The Backward Algorithm
76The Expectation step
77The Expectation step
78The Expectation step
79The Maximization step
80The Maximization step
81The Baum-Welch Algorithm
- Initialize parameters of model
- Iterate until convergence
- calculate the expected number of times each
transition or emission is used - adjust the parameters to maximize the
likelihood of these expected values - This algorithm will converge to a local maximum
(in the likelihood of the data given the model) - Usually in a fairly small number of iterations