Title: MARKOV MODELS
1Section 8
- MARKOV MODELS
- Prepared and presented by Saman Halgamuge
2Markov Chains An Introduction
- Eg nonreturning random walk nonreturning
the walkers are not going back to the location
just previously visited - Markov chain is a stochastic process with the
memory less (Markov) property meaning that the
description of the present state fully captures
all the information about the future evolution of
the process - A Markov chain is a triplet (i.e. characterized
by 3 parameters) - Q is a set of states, each state emits a symbol
in alphabet ?. - p is the probability of initial state being s for
each s ? Q. - A is the state transition probabilities, ast for
each s, t ? Q. - For each s, t ? Q the transition probability is
- For a random process X (x1, x2, , xL), a
Markov chain has the memory less property, the
variable xi depends only on the previous value
(xi-1) and not on the history of the process.
3Markov Chains
- For the sequence X (x1, x2, , xL), the
probability of the sequence is - Using the memory less property of Markov chains,
we get - where p(x1) is the probability of starting in a
particular state. - Add begin and end states with the corresponding
symbols x0 and xL1. Define p(s) as the initial
probability of symbol s, -
4Markov Chain to represent a DNA Sequence
- The probability of the sequence becomes,
- Arrows represent transition probabilities.
- Each state emits the corresponding symbol, i.e.
there is one to one correspondence between
symbols and states.
A Markov Chain for modeling a DNA sequence
Example AAACCCCTTTTGGG Construct the Markov
Chain to represent above sequence
5Using Markov ChainsCpG Islands
- In DNA, the nucleotide sequence CG (abbr. CpG) is
typically modified by a process called
methylation, mutating C into T. - Consequently, CpG dinucleotides are relatively
rarer in the human and most other genomes. - For biologically important reasons, this mutation
is suppressed in short stretches of DNA (few
hundred nucleotides long) around the promoter
(start) site of genes. In these regions CpG is
more frequent. - These regions are called CpG islands. The "p" in
CpG notation refers to the phosphodiester bond
between the cytidine and the Guanosine.
6Using Markov ChainsCpG Islands
- Questions
- Given a short sequence of DNA, how to decide if
it comes from a CpG island or not? - Given a long genome, how to locate CpG islands in
it? - Two Markov chain models can be used to solve the
problem. - The model represents sequences with frequent
CpG islands. - The - model represents sequences with rarer CpG
islands.
7Identifying CpG Islands
- Let, be the transition probabilities in the
model and in the - model. - These probabilities have been calculated for some
known CpG islands and non-CpG regions.
8Identifying CpG Islands
- For a given sequence X of length L, we can now
calculate the probability of the sequence using
the equation . - However, for computational accuracy, we calculate
the log-odds ratio as follows. - It is customary to use logarithmic base 2 when
calculating log-odds ratios (the answer is in
bits).
The histogram of scores for given sequences. The
CpG islands (black) clearly stand out from
non-CpG islands (gray).
9Locating CpG Islands in a Genome
- To solve this problem, we need to combine the two
Markov chains considered earlier into one unified
model.
- Add a small probability of switching from one
chain to the other at each state transition event
(shown by arrows). - There are 2 states corresponding to each
nucleotide symbol, so a symbol emitted does not
reveal the internal state.
- We have 8 states emitting only 4 symbols ? Need
to introduce emission probabilities in addition
to transition probabilities. - This is a hidden Markov model (HMM).
10Hidden Markov Models
- Hidden Markov model (HMM) is a stochastic process
with an underlying stochastic state transition
process that is not observable (hidden). The
underlying process can only be inferred through a
set of symbols emitted sequentially by the
stochastic process. - Example Dishonest Casino dealer States
(hidden) F or L - The set of symbols emitted
1,..,6
11Hidden Markov Model
- HMM is a triplet M (?, Q, ?) where,
- ? is an alphabet of symbols.
- Q is a set of states capable of emitting symbols
from the alphabet ?. - ? is a set of probabilities comprising of,
- State transition probabilities, akl for each k, l
? Q. - Emission probabilities, ek(b) for each k ? Q and
b ? ?. - A path ? (?1,, ?L) is a sequence of states
with the corresponding symbol sequence X (x1,
, xL). - The path itself follows a Markov chain (i.e.
memory less). - There is no one-to-one correspondence between the
states and the symbols.
12Hidden Markov Model
- State transition probabilities
- Emission probabilities
- The probability that the sequence X was
generated by the model M given the path ? is - where ?0 begin state and ?L1 end state.
13HMM for Detecting CpG Islands in Genome
- The HMM model consists of 8 states and 4 symbols.
- States A C G T A- C- G- T-
- Emitted symbols A C G T A C G T
- Probability of staying in a CpG island p
- Probability of staying outside a CpG island q
- Emission probability of symbol A while in state
A or A- 1.0, - Emission probability of symbol B while in state
B or B- 1.0, etc. - All other emission probabilities are zero. (eg.
eA(B) 0.0) - Transition probabilities can be derived from the
two tables considered earlier.
14HMM for Detecting CpG Islands in Genome
- Transition probabilities of the HMM.
15Example HMM for Modeling Dishonest Casino
- A casino dealer uses a fair die most of the time,
but occasionally switches to a loaded die.
Assume, - With the loaded die probability of a six 0.5,
all other numbers have probability of 0.1 - Probability of switching from fair to loaded die
0.05 at each roll. - Probability of switching from loaded to fair die
0.1 at each roll. - Switching between dice is a Markov process.
- In each state of the Markov process, the outcomes
have different probabilities. - The whole process is a HMM.
16Example Dishonest Casino
- There are two possible states Fair and Loaded Q
F, L. - There are six possible outcomes ? 1, 2, 3, 4,
5, 6. - The transition probabilities are shown by arrows.
- The emission probabilities are shown inside each
state box.
17Decoding Problem Most Probable State Path
- Given the HMM M (?, Q, ?) and a sequence of
symbols X ? ?, for which the generating path ?
(?1,, ?L) is unknown, - In general, there could be many state sequences ?
that could give rise to the particular sequence
of symbols X. - Find the most probable generating path ? for X,
i.e. a path such that p (X, ?) is maximized.
18Most Probable State Path
- The solution ? will reveal the hidden states
that generated the sequence X. - CpG island case
- All parts of ? that pass through states are
CpG islands. - Dishonest casino case
- All parts of ? that pass through state L are
suspected rolls of the loaded die. - A solution for the most probable path is given by
the Viterbi algorithm.