Master - PowerPoint PPT Presentation

1 / 83

About This Presentation

Title:

Master

Description:

Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 84

Provided by: mathbio

Category:

more less

Transcript and Presenter's Notes

Title: Master

1
Masters courseBioinformatics Data Analysis
and Tools

Lecture 5 Markov models
Centre for Integrative Bioinformatics

2
Problem in biology

Data and patterns are often not clear cut
When we want to make a method to recognise a
pattern (e.g. a sequence motif), we have to learn
from the data (e.g. maybe there are other
differences between sequences that have the
pattern and those that do not)
This leads to Data mining and Machine learning

3
A widely used machine learning approach Markov
models

Contents
Markov chain models (1st order, higher order and
inhomogeneous models parameter estimation
classification)
Interpolated Markov models (and back-off
models)
Hidden Markov models (forward, backward and
Baum-
Welch algorithms model topologies applications
to gene
finding and protein family modeling

4
(No Transcript)
5
Markov Chain Models

a Markov chain model is defined by
a set of states
some states emit symbols
other states (e.g. the begin state) are silent
a set of transitions with associated
probabilities
the transitions emanating from a given state
define a distribution over the possible next
states

6
Markov Chain Models

given some sequence x of length L, we can ask how
probable the sequence is given our model
for any probabilistic model of sequences, we can
write this probability as
key property of a (1st order) Markov chain the
probability of each Xi depends only on Xi-1

7
Markov Chain Models
Pr(cggt) Pr(c)Pr(gc)Pr(gg)Pr(tg)
8
Markov Chain Models

Can also have an end state, allowing the model to
represent
Sequences of different lengths
Preferences for sequences ending with particular
symbols

9
Markov Chain Models
The transition parameters can be denoted by
where Similarly we can denote the probability of
a sequence x as Where aBxi represents the
transition from the begin state
10
Example Application

CpG islands
CGdinucleotides are rarer in eukaryotic genomes
than expected given the independent probabilities
of C, G
but the regions upstream of genes are richer in
CG dinucleotides than elsewhere CpG islands
useful evidence for finding genes
Could predict CpG islands with Markov chains
one to represent CpG islands
one to represent the rest of the genome
Example includes using Maximum likelihood and
Bayes statistical data and feeding it to a HM
model

11
Estimating the Model Parameters

Given some data (e.g. a set of sequences from CpG
islands), how can we determine the probability
parameters of our model?
One approach maximum likelihood estimation
given a set of data D
set the parameters ? to maximize
Pr(D ?)
i.e. make the data D look likely under the model

12
Maximum Likelihood Estimation

Suppose we want to estimate the parameters Pr(a),
Pr(c), Pr(g), Pr(t)
And were given the sequences
accgcgctta
gcttagtgac
tagccgttac
Then the maximum likelihood estimates are
Pr(a) 6/30 0.2 Pr(g) 7/30 0.233
Pr(c) 9/30 0.3 Pr(t) 8/30 0.267

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
These data are derived from genome sequences
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Higher Order Markov Chains

An nth order Markov chain over some alphabet is
equivalent to a first order Markov chain over the
alphabet of n-tuples
Example a 2nd order Markov model for DNA can be
treated as a 1st order Markov model over
alphabet
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT,
TA, TC, TG, and TT (i.e. all possible dipeptides)

22
A Fifth Order Markov Chain
23
Inhomogenous Markov Chains

In the Markov chain models we have considered so
far, the probabilities do not depend on where we
are in a given sequence
In an inhomogeneous Markov model, we can have
different distributions at different positions in
the sequence
Consider modeling codons in protein coding
regions

24
Inhomogenous Markov Chains
25
A Fifth Order InhomogenousMarkov Chain
26
Selecting the Order of aMarkov Chain Model

Higher order models remember more history
Additional history can have predictive value
Example
predict the next word in this sentence
fragment finish __ (up, it, first, last, ?)
now predict it given more history
Fast guys finish __

27
Selecting the Order of aMarkov Chain Model

However, the number of parameters we need to
estimate grows exponentially with the order
for modeling DNA we need parameters for an nth
order model, with n ? 5 normally
The higher the order, the less reliable we can
expect our parameter estimates to be
estimating the parameters of a 2nd order
homogenous Markov chain from the complete genome
of E. Coli, we would see each word gt 72,000 times
on average
estimating the parameters of an 8th order
chain, we would see each word 5 times on
average

28
Interpolated Markov Models

The IMM idea manage this trade-off by
interpolating among models of various orders
Simple linear interpolation

29
Interpolated Markov Models

We can make the weights depend on the history
for a given order, we may have significantly
more data to estimate some words than others
General linear interpolation

30
Gene Finding Search by Content

Encoding a protein affects the statistical
properties of a DNA sequence
some amino acids are used more frequently than
others (Leu more popular than Trp)
different numbers of codons for different
amino acids (Leu has 6, Trp has 1)
for a given amino acid, usually one codon is
used more frequently than others
This is termed codon preference
Codon preferences vary by species

31
Codon Preference in E. Coli

AA codon /1000
----------------------
Gly GGG 1.89
Gly GGA 0.44
Gly GGU 52.99
Gly GGC 34.55
Glu GAG 15.68
Glu GAA 57.20
Asp GAU 21.63
Asp GAC 43.26

32
Search by Content
Common way to search by content build
Markov models of coding noncoding regions
apply models to ORFs (Open Reading Frames) or
fixed- sized windows of sequence GeneMark
Borodovsky et al. popular system for
identifying genes in bacterial genomes uses
5th order inhomogenous Markov chain models
33
The GLIMMER System

Salzberg et al., 1998
System for identifying genes in bacterial genomes
Uses 8th order, inhomogeneous, interpolated
Markov chain models

34
IMMs in GLIMMER

How does GLIMMER determine the values?
First, let us express the IMM probability
calculation recursively

35
IMMs in GLIMMER

If we havent seen xi-1 xi-n more than 400
times, then compare the counts for the following

Use a statistical test ( ?2) to get a value d
indicating our confidence that the distributions
represented by the two sets of counts are
different
36
IMMs in GLIMMER
?2 score when comparing nth-order with
n-1th-order Markov model (preceding slide)
37
The GLIMMER method

8th order IMM vs. 5th order Markov model
Trained on 1168 genes (ORFs really)
Tested on 1717 annotated (more or less known)
genes

38
(No Transcript)
39
(No Transcript)
40
Plot sensitivity over 1-specificity
41
(No Transcript)
42
(No Transcript)
43
Hidden Markov models (HMMs)
Given say a T in our input sequence, which state
emitted it?
44
Hidden Markov models (HMMs)

Hidden State
We will distinguish between the observed parts
of a problem and the hidden parts
In the Markov models we have considered
previously, it is clear which state accounts for
each part of the observed sequence
In the model above (preceding slide), there are
multiple states that could account for each part
of the observed sequence
this is the hidden part of the problem
states are decoupled from sequence symbols

45
HMM-based homology searching
HMM for ungapped alignment Transition
probabilities and Emission probabilities Gapped
HMMs also have insertion and deletion states
(next slide)
46
Profile HMM mmatch state, I-insert state,
ddelete state go from left to right. I and m
states output amino acids d states are silent.
Model for alignment with insertions and deletions
47
HMM-based homology searching

Most widely used HMM-based profile searching
tools currently are SAM-T99 (Karplus et al.,
1998) and HMMER2 (Eddy, 1998)
formal probabilistic basis and consistent theory
behind gap and insertion scores
HMMs good for profile searches, bad for alignment
(due to parametrisation of the models)
HMMs are slow

48
Homology-derived Secondary Structure of Proteins
(HSSP) Sander Schneider, 1991
Its all about trying to push dont know region
down
49
The Parameters of an HMM
50
HMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden
Markov Models for Biological Sequences
51
A Simple HMM
52
Three Important Questions

How likely is a given sequence?
the Forward algorithm
What is the most probable path for generating a
given sequence?
the Viterbi algorithm
How can we learn the HMM parameters given a set
of sequences?
the Forward-Backward (Baum-Welch) algorithm

53
Three basic problems of HMMs

Once we have an HMM, there are three problems of
interest.
(1) The Evaluation Problem
Given an HMM and a sequence of observations,
what is the probability that the observations are
generated by the model?
(2) The Decoding Problem
Given a model and a sequence of observations,
what is the most likely state sequence in the
model that produced the observations?
(3) The Learning Problem
Given a model and a sequence of observations, how
should we adjust the model parameters in order
to maximize
Evaluation problem can be used for isolated
(word) recognition. Decoding problem is related
to the continuous recognition as well as to the
segmentation. Learning problem must be solved, if
we want to train an HMM for the subsequent use of
recognition tasks.

54
Three Important Questions

How likely is a given sequence?
Forward algorithm
What is the most probable path for generating a
given sequence?
How can we learn the HMM parameters given a set
of sequences?

55
How Likely is a Given Sequence?

The probability that the path is taken and the
sequence is generated
(assuming begin/end are the only silent states on
path)

56
How Likely is a Given Sequence?
57
How Likely is a Given Sequence?
The probability over all paths is but the
number of paths can be exponential in the length
of the sequence... the Forward algorithm
enables us to compute this efficiently
58
How Likely is a Given SequenceThe Forward
Algorithm

Define fk(i) to be the probability of being in
state k
Having observed the first i characters of x we
want to compute fN(L), the probability of being
in the end state having observed all of x
We can define this recursively

59
How Likely is a Given Sequence
60
The forward algorithm

Initialisation
f0(0) 1 (start),
fk(0) 0 (other silent states k)
Recursion fl(i) el(i)?k fk(i-1)akl
(emitting states),
fl(i) ?k fk(i)akl (silent states)
Termination
Pr(x) Pr(x1xL) f N(L) ?k fk(L)akN

probability that were in start state and have
observed 0 characters from the sequence
probability that we are in the end state and have
observed the entire sequence
61
Forward algorithm example

62
Three Important Questions

How likely is a given sequence?
What is the most probable path for generating a
given sequence?
Viterbi algorithm
How can we learn the HMM parameters given a set
of sequences?

63
Finding the Most Probable PathThe Viterbi
Algorithm

Define vk(i) to be the probability of the most
probable path accounting for the first i
characters of x and ending in state k
We want to compute vN(L), the probability of the
most probable path accounting for all of the
sequence and ending in the end state
Can be defined recursively
Can use DP to find vN(L) efficiently

64
Finding the Most Probable PathThe Viterbi
Algorithm

Initialisation
v0(0) 1 (start), vk(0) 0 (non-silent states)
Recursion for emitting states (i 1L)
Recursion for silent states

65
Finding the Most Probable PathThe Viterbi
Algorithm
66
Three Important Questions

How likely is a given sequence? (clustering)
What is the most probable path for generating a
given sequence? (alignment)
How can we learn the HMM parameters given a set
of sequences?
The Baum-Welch Algorithm

67
The Learning Problem Generally, the learning
problem is how to adjust the HMM parameters, so
that the given set of observations (called the
training set) is represented by the model in the
best way for the intended application. Thus it
would be clear that the quantity'' we wish to
optimize during the learning process can be
different from application to application. In
other words there may be several optimization
criteria for learning, out of which a suitable
one is selected depending on the
application. There are two main optimization
criteria found in the literature Maximum
Likelihood (ML) and Maximum Mutual Information
(MMI).
68
The Learning Task

Given
a model
a set of sequences (the training set)
Do
find the most likely parameters to explain the
training sequences
The goal is find a model that generalizes well to
sequences we havent seen before

69
Learning Parameters

If we know the state path for each training
sequence, learning the model parameters is simple
no hidden state during training
count how often each parameter is used
normalize/smooth to get probabilities
process just like it was for Markov chain
models
If we dont know the path for each training
sequence, how can we determine the counts?
key insight estimate the counts by
considering every path weighted by its
probability

70
Learning ParametersThe Baum-Welch Algorithm

An EM (expectation maximization) approach, a
forward-backward algorithm
Algorithm sketch
initialize parameters of model
iterate until convergence
Calculate the expected number of times each
transition or emission is used
Adjust the parameters to maximize the likelihood
of these expected values
Baum-Welch has as important feature that it
always converges

71
The Expectation step
72
The Expectation step
73
The Expectation step
74
The Expectation step
75
The Expectation step

First, we need to know the probability of the i
th symbol being produced by state q, given
sequence x
Pr( ?i k x)
Given this we can compute our expected counts for
state transitions, character emissions

76
The Expectation step
77
The Backward Algorithm
78
The Expectation step
79
The Expectation step
80
The Expectation step
81
The Maximization step
82
The Maximization step
83
The Baum-Welch Algorithm

Initialize parameters of model
Iterate until convergence
calculate the expected number of times each
transition or emission is used
adjust the parameters to maximize the
likelihood of these expected values
This algorithm will converge to a local maximum
(in the likelihood of the data given the model)
Usually in a fairly small number of iterations

Write a Comment

User Comments (0)