Title: PROLOGUE:
1PROLOGUE Pitfalls of standard alignments
2Scoring a pairwise alignment
A ALAEVLIRLITKLYP B ASAKHLNRLITELYP
Blosum62
3Alignment of a family (globins)
.
Different positions are not equivalent
4Sequence logos
http//weblogo.berkeley.edu/cache/file5h2DWc.png
The substitution score IN A FAMILY should depend
on the position (the same for gaps)
For modelling families we need more flexible tools
5Probabilistic Models for Biological Sequences
6Probabilistic models for sequences
- Generative definition
- Objects producing different outcomes (sequences)
with different probabilities - The probability distribution over the sequences
space determines the model specificity
Probability
Sequence space
M
Generates si with probability P(si M) e.g. M
is the representation of the family of globins
7Probabilistic models for sequences
- We dont need a generator of new biological
sequences - the generative definition is useful as operative
definition
- Associative definition
- Objects that, given an outcome (sequence),
compute a probability value
Probability
M
Sequence space
Associates probability P(si M) to si e.g. M
is the representation of the family of globins
8Probabilistic models for sequences
Most useful probabilistic models are Trainable
systems The probability density function over
the sequence space is estimated from known
examples by means of a learning algorithm
Known examples
Pdf estimate (generalization)
Probability
Probability
Sequence space
Sequence space
e.g. Writing a generic representation of the
sequences of globins starting from a set of known
globins
9Probabilistic Models for Biological Sequences
- What are they?
- Why to use them?
10Modelling a protein family
Probabilistic model
Given a protein class (e.g. Globins), a
probabilistic model trained on this family can
compute a probability value for each new sequence
This value measures the similarity between the
new sequence and the family described by the model
11Probabilistic Models for Biological Sequences
- What are they?
- Why to use them?
- Which probabilities do they compute?
12P( s M ) or P( M s ) ?
A model M associates to a sequence s the
probability P( s M ) This probability
answers the question Which is the probability
for a model describing the Globins to generate
the sequence s ? The question we want to answer
is Given a sequence s, is it a Globin? We
need to compute P( M s ) !!
13Bayes Theorem
P(X,Y) P(X Y) P(Y) P(Y X) P(X) Joint
probability So
P(X Y) P(Y)
P(Y X)
P(X)
P(s M) P(M)
A priori probabilities
P(M s)
P(s)
P(s M)
P(M s)
Evidence M
Conclusion M
Conclusion s
Evidence s
14The A priori probabilities
P(s M) P(M)
A priori probabilities
P(M s)
P(s)
P(M) is the probability of the model (i.e. of the
class described by the model) BEFORE we know the
sequence can be estimated as the abundance of
the class
P(s) is the probability of the sequence in the
sequence space. Cannot be reliably estimated!!
15Comparison between models
We can overcome the problem comparing the
probability of generating s from different models
Ratio between the abundance of the classes
16Null model
Otherwise we can score a sequence for a model M
comparing it to a Null Model a model that
generates ALL the possible sequences with
probabilities depending ONLY on the statistical
amino acid abundance
P(s M)
S(M, s) log
P(s N)
Sequences NOT belonging to model M
Sequences belonging to model M
S(M, s)
In this case we need a threshold and a statistic
for evaluating the significance (E-value, P-value)
17The simplest probabilistic models Markov Models
18Markov Models Example Weather
Register the weather conditions day by day as a
first hypothesis the weather condition in a day
depends ONLY on the weather conditions in the
day before. Define the conditional
probabilities P(CC), P(CR),. P(RC)..
R
C
S
F
C Clouds R Rain F Fog S Sun
The probability for the 5-days registration
CRRCS P(CRRCS) P(C)P(RC) P(RR) P(CR)
P(SC)
19Markov Model
Stochastic generator of sequences in which the
probability of state in position i depends ONLY
on the state in position i-1 Given an alphabet
C c1 c2 c3 cN a Markov model is
described with N(N2) parameters art , aBEGIN
t , ar END r, t ? C arq P( s i q s i-1
r ) aBEGIN q P( s 1 q ) ar END P( s T
END s T-1 r )
c2
c1
c3
END
BEGIN
?t art ar END 1 ? r ?t aBEGIN t 1
cN
c4
20Markov Models
Given the sequence s s1s2s3s4s6
sT with si ? C c1 c2 c3 cN
P( s M ) P( s1 ) ? ?i2 P( s i s i-1 )
P(ALKALI) aBEGIN A ? aA L ? aL K ? aK A ? aA L
? aL I ? aI END
21Markov Models Exercise
1) Fill the non defined values for the transition
probabilities
22Markov Models Exercise
2) Which model better describes the weather in
summer? Which one describes the weather in
winter?
23Markov Models Exercise
Winter
3) Given the sequence CSSSCFS which model
gives the higher probability? Consider the
starting probabilities P(XBEGIN)0.25
Summer
24Markov Models Exercise
Winter
P (CSSSCFS Winter) 0.25x0.1x0.2x0.2x0.3x0.2x0
.2 1.2 x 10-5 P (CSSSCFS Summer)
0.25x0.4x0.8x0.8x0.1x0.1x1.0 6.4 x 10-4 4)
Can we conclude that the observation sequence
refers to a summer week?
Summer
25Markov Models Exercise
Winter
P (Seq Winter) 1.2 x 10-5 P (Seq Summer)
6.4 x 10-4
Summer
P(Seq Summer) P(Summer)
P(Seq Winter) P(Winter)
26Simple Markov Model for DNA sequences
DNA C Adenine, Citosine, Guanine, Timine
16 transition probabilities (12 of which
independent) 4 Begin probabilities 4 End
probabilities.
The parameters of the model are different in
different zones of DNA They describe the overall
composition and the couple recurrences
27Example of Markov Models GpC Island
GATGCGTCGC CTACGCAGCG
GpC Islands
Non-GpC Islands
In the Markov Model of GpC Islands aGC is higher
than in Markov Model Non-GpC Islands
Given a sequence s we can evaluate
28The simplest probabilistic models Markov Models
29Probabilistic training of a parametric method
Generally speaking, a parametric model M aims to
reproduce a set of known data
Model M Parameters T
Modelled data
Real data (D)
How to compare them?
30Training of Markov Models
Let ?M be the set of parameters of model
M. During the training phase, ?M parameters are
estimated from the set of known data D Maximum
Likelihood Extimation (ML) ?ML argmax? P( D
M, ? ) It can be proved that
Frequency of occurrence as counted in the data
set D
Maximum A Posteriori Extimation (MAP) ?MAP
argmax? P( ? M, D ) argmax? P( D M, ? ) ?
P(? )
31Example (coin-tossing)
Given N tossing of a coin (our data D), the
outcomes are h heads and t tails (Nth) ASSUME
the model P(DM) ph (1- p)t Computing the
maximum likelihood of P(DM)
We obtain that our estimate of p is
p h / (ht) h / N
32Example (Error measure)
Suppose you think that your data are affected by
a Gaussian error So that they are distributed
according to F(xi)Aexp-(xi m)2 /2s 2 With
A1/sqrt(2? s) If your measures are independent
the data likelihood is
P(Data model) Pi F(xi)
Find m and s that maximize the P(Data model)
33Maximum Likelihood training Proof
Given a sequence s contained in D s
s1s2s3s4s6 sT
We can count the number of transitions between
any to states j and k njk
Where states 0 and N1 are BEGIN and END
Normalisation contstraints are taken into account
using the Lagrange multipliers lk
34Hidden Markov Models
35Loaded dice
We have 99 regular dice (R) and 1 loaded die
(L). P(1) P(2) P(3) P(4) P(5) P(6) R 1/6 1/6 1/
6 1/6 1/6 1/6 L 1/10 1/10 1/10 1/10 1/10 1/2
Given a sequence 4156266656321636543662152611536
264162364261664616263 We dont know the sequence
of dice that generated it. RRRRRLRLRRRRRRRLRRRRRR
RRRRRRLRLRRRRRRRRLRRRRLRRRRLRR
36Loaded dice
Hypothesis We chose a different die for each
roll Two stochastic processes give origin to the
sequence of observations. 1) Choosing the die (
R o L ). 2) Rolling the
die The sequence of dice is hidden The first
process is assumed to be Markovian (in this case
a 0-order MM) The outcome of the second process
depends only on the state reached in the first
process (that is the chosen die)
37Casinò
Model Each state (R and L) generates a
character of the alphabet C 1, 2, 3, 4, 5, 6
The emission probabilities depend only on the
state. The transition probabilities describe a
Markov model that generates a state path the
hidden sequence (p) The observations sequence
(s) is generated by two concomitant stochastic
processes
0.01
R
L
0.01
0.99
0.99
38The observations sequence (s) is generated by two
concomitant stochastic processes
4156266656321636543662152611 RRRRRLRLRRRRRRRLRRRRR
RRRRRRR
0.01
R
L
0.01
0.99
0.99
Choose the State R Probability 0.99 Chose
the Symbol 1 Probability 1/6 (given R)
4156266656321636543662152611 RRRRRLRLRRRRRRRLRRRRR
RRRRRRR
39The observations sequence (s) is generated by two
concomitant stochastic processes
415626665632163654366215261 RRRRRLRLRRRRRRRLRRRRRR
RRRRR
0.01
R
L
0.01
0.99
0.99
Choose the State L Probability 0.99 Chose
the Symbol 5 Probability 1/10 (given L)
41562666563216365436621526115 RRRRRLRLRRRRRRRLRRRR
RRRRRRRRL
40Loaded dice
Model Each state (R and L) generates a
character of the alphabet C 1, 2, 3, 4, 5, 6
The emission probabilities depend only on the
state. The transition probabilities describe a
Markov model that generates a state path the
hidden sequence (p) The observations sequence
(s) is generated by two concomitant stochastic
processes
41Some not so serious example
1) DEMOGRAPHY Observable Number of births
and deaths in a year in a village. Hidden
variable Economic conditions (as a first
approximation we can consider the success in
business as a random variable, and by
consequence, the wealth as a Markov variable
---gt can we deduce the economic conditions of a
village during a century by means of the register
of births and deaths? 2) THE METEREOPATHIC
TEACHER Observable Average of the marks
that a metereopathic teacher gives to their
students during a day. Hidden variable
Weather conditions ---gt can we deduce the
weather conditions during a years by means of the
class register?
42To be more serious
1) SECONDARY STRUCTURE Observable protein
sequence Hidden variable secondary
structure ---gt can we deduce (predict) the
secondary structure of a protein given its amino
acid sequence? 2) ALIGNMENT Observable
protein sequence Hidden variable position
of each residue along the alignment of a protein
family ---gt can we align a protein to a
family, starting from its amino acid sequence?
43Hidden Markov Models
- Preliminary examples
- Formal definition
44Formal definition of Hidden Markov Models
- A HMM is a stochastic generator of sequences
characterised by - N states
- A set of transition probabilities between two
states akj - akj P( ? (i) j ? (i-1) k )
- A set of starting probabilities a0k
- a0k P( ? (1) k )
- A set of ending probabilities ak0
- ak0 P( p (i) END ? (i-1) k )
- An alphabet C with M characters.
- A set of emission probabilities for each state
ek (c) - ek (c) P( s i c ? (i) k )
- Constraints
- ?k a0 k 1
- ak0 ?j ak j 1 ? k
- ?c? C ek (c) 1 ? k
s sequence p path through the states
45Generating a sequence with a HMM
46GpC Island, simple model
- s AGCGCGTAATCTG
- p YYYYYYYNNNNNN
- P( s, p M ) can be easily computed
47P( s, p M ) can be easily computed
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1?0.25?0.25?0.25?0.25?0.25?0.25 Transition
0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.2?0.8
? 0.8 ?0.8?0.8? 0.8 ? 0.1
Multiplying all the probabilities gives the
probability of having the sequence AND the path
through the states
48Evaluation of the joint probability of the
sequence ad the path
49Hidden Markov Models
- Preliminary examples
- Formal definition
- Three questions
50GpC Island, simple model
- s AGCGCGTAATCTG
- p?????????????
- P( s, p M ) can be easily computed
- How to evaluate P ( s M )?
51How to evaluate P ( s M )?
P ( s M ) ?p P( s, p M )
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N p5 Y Y Y Y Y Y Y Y
Y Y N Y Y
213 different paths Summing over all the
path will give the probability of having the
sequence
52Resumé GpC Island, simple model
- s AGCGCGTAATCTG
- p ?????????????
- P( s, p M ) can be easily computed
- How to evaluate P ( s M )?
- Can we show the hidden path?
53Can we show the hidden path?
p argmax p P( p s, M ) argmax p P(
p , s M )
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N p5 Y Y Y Y Y Y Y Y
Y Y N Y Y
213 different paths Viterbi path path
that gives the best joint probability
54Can we show the hidden path?
A Posteriori decoding For each position choose
the state p (t) p (i) argmax k P( p (i)
k s, M ) The contribution to this
probability derives from all the paths that go
through the state k at position i. The A
posteriori path can be a non-sense path (it may
not be a legitimate path if some transitions are
not permitted in the model)
55GpC Island, simple model
- s AGCGCGTAATCTG
- p YYYYYYYNNNNNN
- P( s, p M ) can be easily computed
- How to evaluate P ( s M )?
- Can we show the hidden path?
- Can we evaluate the parameters starting from
known examples?
56Can we evaluate the parameters starting from
known examples?
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission eY (A)?eY (G)?eY (C)?e
Y(G)?eY (C)?eY (G)?eY (T)?eN (A)?eN (A)?eN (T)?eN
(C)?eN (T)?eN (G) Transition a0Y ? aYY ?
aYY ? aYY ? aYY ? aYY ? aYY ? aYN ? aNN ? aNN
? aNN ? aNN? aNN ? aN0
How to find the parameters e and a that maximises
this probability? How if we dont know the path?
57Hidden Markov ModelsAlgorithms
- Resumé
- Evaluating P(s M) Forward Algorithm
58Computing P( s,p M ) for each path is a
redundant operation
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y Y
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.4 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.7 ? 0.1
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.25 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.2 ? 0.1
59Computing P( s,p M ) for each path is a
redundant operation
If we compute the common part only once we gain
2(T-1) operations
60Summing over all the possible paths
s A G p Y Y
s A G p N Y
Emission 0.1 ? 0.4 Transition 0.2
? 0.7
Emission 0.25?0.4 Transition 0.8 ?
0.1
0.0056
0.008
s A G p Y N
s A G p N N
Emission 0.1 ? 0.25 Transition 0.2
? 0.2
Emission 0.25?0.25 Transition 0.8
? 0.8
0.04
0.001
61Summing over all the possible paths
s A G C p X Y Y
s A G C p X N Y
0.4 0.7
0.4 0.1
?
?
0.0136
0.041
s A G C p X Y N
s A G C p X N N
0.25 0.2
0.25 0.8
?
?
0.0136
0.041
62Summing over all the possible paths
Iterating until the last position of the sequence
A G C G C G T A A T C T G X X X X
X X X X X X X X Y
?
0.1 (aY0)
A G C G C G T A A T C T G X X X X
X X X X X X X X N
?
0.1 (aN0)
P(sM)
63Summing over all the possible paths
If we know the probabilities of emitting the two
first characters of the sequence ending the path
in states L and R respectively FR(2) ?
P(s1,s2,p(2) R M) and FL(2) ? P(s1,s2,p(2)
L M) then we can compute P(s1,s2,s3,p(3)
R M) FR(2) aRR eR(s3) FL(2) aLR
eR(s3)
64Forward Algorithm
On the basis of preceding observations the
computation of P(s M) can be decomposed in
simplest problems For each state k and each
position i in the sequence, we compute Fk(i)
P( s1s2s3s i, p (i) k M)
Initialisation FBEGIN (0) 1 Fi (0)
0 ? i ? BEGIN Recurrence Fl ( i1) P(
s1s2s is i1, p (i 1) l )
?k P( s1s2 s i, p (i) k ) ? a k l ? e l
( s i1 ) e l ( s i1 )
? ?k Fk ( i ) ? a k l Termination P( s ) P(
s1s2s3s T, p (T 1) END )
?k P( s1s2 s T , p (T) k ) ? a k0
?k Fk ( T ) ? a k 0
Will be understood
65Forward Algorithm
STATE
Fi (1) aiB
P(s M)
END
eB (s2)
FB (2)
B
A
S
BEGIN
0
1
T
T 1
Iteration
66Forward algorithm computational complexity
- Naïf method
- P ( s M ) ?p P( s, p M )
- There are N T possible paths.
- Each path requires about 2?T operations.
- The time for the computation is O( T? N T )
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y Y
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.4 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.7 ? 0.1
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.25 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.2 ? 0.1
67Forward algorithm computational complexity
- Forward algorithm
- T positions, N values for each position
- Each element requires about 2?N product and 1
sum - The time for the computation is O(T? N2)
s A G C p X Y Y
s A G C p X N Y
s A G C p X X Y
0.4 0.7
0.4 0.1
?
?
0.0136
0.041
0.005448
Sum
s A G C p X X N
s A G C p X Y N
s A G C p X N N
0.25 0.2
0.25 0.8
?
?
0.0136
0.041
0.00888
68Forward algorithm computational complexity
Naïf method
Forward algorithm
69Hidden Markov ModelsAlgorithms
- Resumé
- Evaluating P(s M) Forward Algorithm
- Evaluating P(s M) Backward Algorithm
70Backward Algorithm
Similar to the Forward algorithm it computes P(
s M ), reconstructing the sequence from the
end For each state k and each position i in the
sequence, we compute Bk(i) P( s i1s i2s
i3s T p (i) k ) Initialisation Bk (T)
P(p (T1) END p (T) k ) ak0 Recurrence
Bl ( i-1) P(s is i1s T p (i - 1) l )
?k P(s i1s i2s T p (i) k) ?
a l k ? e k (s i ) ?k Bk ( i ) ? e
k ( s i ) ? a l k Termination P( s ) P(
s1s2s3s T p (0) BEGIN ) ?k P(
s2 s T p (1) k ) ? a 0 k ? e k ( s 1 )
?k Bk ( 1 ) ? a 0k ? e k ( s 1 )
71Backward Algorithm
STATE
Bk(T) aB T e k (s T-1 )
END
BB (T-1)
B
A
S
BEGIN
0
1
2
T
T 1
T-1
P(s M)
Iteration
72Hidden Markov ModelsAlgorithms
- Resumé
- Evaluating P(s M) Forward Algorithm
- Evaluating P(s M) Backward Algorithm
- Showing the path Viterbi decoding
73Finding the best path
s A G p Y Y
s A G p N Y
Emission 0.1 ? 0.4 Transition 0.2
? 0.7
Emission 0.25?0.4 Transition 0.8 ?
0.1
0.0056
0.008
s A G p Y N
s A G p N N
Emission 0.1 ? 0.25 Transition 0.2
? 0.2
Emission 0.25?0.25 Transition 0.8
? 0.8
0.04
0.001
74Finding the best path
s A G C p N Y Y
s A G C p N N Y
0.4 0.7
0.4 0.1
?
?
0.008
0.04
0.00224
0.0016
s A G C p N Y N
s A G C p N N N
0.25 0.2
0.25 0.8
?
?
0.008
0.04
0.0004
0.008
75Finding the best path
Iterating until the last position of the sequence
A G C G C G T A A T C T G N Y Y Y
Y Y Y N N N Y Y Y
?
0.1 (aY0)
A G C G C G T A A T C T G N N N N
N N N N N N N N N
?
0.1 (aN0)
Choose the Maximum
76Viterbi Algorithm
p argmax p P( p , s M ) The computation
of P(s,p M) can be decomposed in simplest
problems Let Vk(i) be the probability of the
most probable path for generating the subsequence
s1s2s3s i ending in the state k at iteration
i Initialisation VBEGIN (0) 1 Vi (0)
0 ? i ? BEGIN Recurrence Vl ( i1) e l ( s
i1 ) ? Max k ( Vk ( i ) ? a k l ) ptr i ( l )
argmax k ( Vk ( i ) ? a k l ) Termination
P( s, p ) Maxk (Vk ( T ) ? a k 0 ) p ( T )
argmax k (Vk ( T ) ? a k 0 ) Traceback p ( i-1
) ptr i (p ( i ))
77Viterbi Algorithm
STATE
Vi (1) aiB
P(s,? M)
END
eB (s2)
VB (2)
B
A
MAX
BEGIN
0
1
T
T 1
Iteration
ptr2 (B)
78Viterbi Algorithm
Viterbi path
Different paths can have the same probability
79Hidden Markov ModelsAlgorithms
- Resumé
- Evaluating P(s M) Forward Algorithm
- Evaluating P(s M) Backward Algorithm
- Showing the path Viterbi decoding
- Showing the path A posteriori decoding
- Training a model EM algorithm
80If we know the path generating the training
sequence
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission eY (A)?eY (G)?eY (A)?e
Y(G)?eY (C)?eY (G)?eY (T)?eN (A)?eN (A)?eN (T)?eN
(C)?eN (T)?eN (G) Transition a0Y ? aYY ?
aYY ? aYY ? aYY ? aYY ? aYY ? aYN ? aNN ? aNN
? aNN ? aNN? aNN ? aN0
Just count! Example aYY nYY /(nYY nYN)
6/7 eY(A) nY(A) /nY(A)nY(C) nY(G) nY(T)
1/7
81Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood
parameters when the paths generating the training
sequences are unknown qML argmaxq P ( s q,
M) Given a model with parameters q0 the EM
algorithm finds new parameters q that increase
the likelihood of the model P( s q ) gt P( s
q0 )
82Expectation-Maximisation algorithm
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N ...
Given a path p we can count the number of
transition between states k and l Ak,l(p) the
number on emissions of character c from state k
Ek (c,p)
We can compute the expected values over all the
paths
Ak,l Sp P(p s,q0) Ak,l(p) Ek (c) Sp P(p
s,q0) Ek (c,p)
The updated parameters are
ak,l ek(c)
83Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood
parameters when the paths generating the training
sequences are unknown qML argmaxq P ( s q,
M) Given a model with parameters q0 the EM
algorithm finds new parameters q that increase
the likelihood of the model P( s q ) gt P( s
q0 ) or equivalentely log P( s q ) gt log P(
s q0 )
84Expectation-Maximisation algorithm
log P( s q ) log P(s,p q ) - log P(p s,q
) Multiplying for P(p s,q0) and summing over
all the possible paths log P( s q ) Sp
P(p s,q0) log P(s,p q ) - Sp P(p s,q0)
log P(p s,q)
Q(q q0) Expectation value of log P(s,p q )
over all the current paths
log P( s q ) - log P(s q0) Q(q q0)
- Q(q0q0) ? Q(q q0) - Q(q0q0)
? 0
85Expectation-Maximisation algorithm
The EM algorithm is an iterative process Each
iteration performs two steps E-step evaluation
of Q(q q0) Sp P(p s,q0) log P(s,p q )
M-step Maximisation of Q(q q0) over all q It
does NOT assure to converge to the GLOBAL Maximum
Likelihood
86Baum-Welch implementation of the EM algorithm
E-step Q( q q0) Sp P(p s,q0) log P(s,p q
) P(s,p q ) a0,p(1) Pi 1 ap(i),p(i1)
ep(i)(si) Pk 0 Pl 1 ak,l
Pk 1 Pc ? C ek (c)
T
N
N
N
Ak.l (p)
Ek (c,p)
Ak,l Sp P(p s,q0) Ak,l(p) Ek (c) Sp P(p
s,q0) Ek (c,p)
Expected values over all the actual paths
87Baum-Welch implementation of the EM algorithm
M-step
For any state k and character c, with Sc ek(c)
1
By means of Lagranges multipliers techniques, we
can solve the system
ak,l ek(c)
88Baum-Welch implementation of the EM algorithm
How to compute the expected number of transitions
and emissions over all the paths
Fk(i) P( s1s2s3s i, p (i) k ) Bk(i) P(
s i1s i2s i3s T p (i) k ) Ak,l P(p
(i ) k , p ( i 1) l s,q) Ek (c) P(
s i c , p (i ) k s,q)
89Baum-Welch implementation of the EM algorithm
Algorithm
Start with random parameters
Compute Forward and Backward matrices on the
known sequences
Compute Ak,l and Ek (c) expected numbers of
transitions and emissions
Update a k,l? Ak,l ek (c) ? Ek (c)
Has P(sM) incremented ?
Yes
No
End
90Profile HMMs
91Profile HMMs
92How to align?
M5
M0
M1
M2
M3
M4
Each state represent a position in the alignment.
A C G G T A M0 M1 M2 M3 M4 M5 A C G A T C M0 M1 M
2 M3 M4 M5 A T G T T C M0 M1 M2 M3 M4 M5
Each position has a peculiar composition
93Given a set of sequences..
A C G G T A A C G A T C A T G T T C
..we can train a model..
M5
M0
M1
M2
M3
M4
..estimating the emission probabilities.
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
94Given a trained model..
M5
M0
M1
M2
M3
M4
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
..we can align a new sequence..
A C G A T C
..computing the emission probability
P(sM) 1 0.66 1 0.33 1 0.66
95And for the sequence AGATC ?
M5
M0
M1
M2
M3
M4
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
A G A T C M0 M2 M3 M4 M5 M5
We need a way to introduce gaps
96Silent states
Red transitions allow gaps (N-1) ! transitions
To reduce the number of parameters we can use
states that doesnt emit any character 4N-8
transitions
97Profile HMMs
Delete states
Insert states
Match states
A C G G T A M0 M1
M2 M3 M4 M5 A C
G C A G T C M0 I0 I0 M1
M2 M3 M4 M5 A
G A T C M0 D1
M2 M3 M4 M5
98Example of alignment
- Sequence 1
- A S T R A L
- Viterbi path
- M0 M1 M2 M3 M4 M5
- A S T R A L
- Sequence 2
- A S T A I L
- Viterbi path
- M0 M1 M2 D3 M4 I4 M5
- A S T A I L
- Sequence 3
- A R T I
- Viterbi path
- M0 M1 M2 D3 D4 M5
- A R T I
99Example of alignment
-Log P(s M) Is an alignment score
100Searching for a structural/functional pattern in
protein sequence
Zn binding loop
C H C I C R I C C H C L C
K I C C H C I C S L C D H
C L C T I C C H C I D S
I C C H C L C K I C
Cysteines can be replaced by an Aspartic Acid,
but only ONCE for each sequence
101Searching for a structural/functional pattern in
protein sequences
..ALCPCHCLCRICPLIY..
obtains higher probability than
..WERWDHCIDSICLKDE..
.. because M0 and M4 have low emission
probability for Aspartic Acid and we multiply
them twice.
102Profile HMMs
- HMMs for alignments
- Example on globins
103Structural alignment of globins
104Structural alignment of globins
Bashdorf D, Chothia C Lesk AM, (1987)
Determinants of a protein fold unique features
of the globin amino sequence. J.Mol.Biol. 196,
199-216
105Alignment of globins reconstructed with profile
HMMs
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
106Discrimination power of profile HMMs
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
107Profile HMMs
- HMMs for alignments
- Example on globins
- Other applications
108Finding a domain
Profile HMM specific for the considered domain
Begin
End
109Clustering subfamilies
HMM 1
HMM 2
.END.
BEGIN
HMM 3
HMM n
Each sequence s contributes to update HMM i with
a weight equal to P ( s Mi )
110Profile HMMs
- HMMs for alignments
- Example on globins
- Other applications
- Available codes and servers
111HMMER at WUSTL http//hmmer.wustl.edu/
Eddy SR (1998) Profile hidden Markov models.
Bioinformatics 14755-763
112HMMER
Alignment of a protein family
Takes the aligned sequences, checks for
redundancy and sets the emission and the
transitions probabilities of a HMM
hmmbuild
Trained profile-HMM
Takes a trained HMM, generates a great number of
random sequences, score them and fits the Extreme
Value Distribution to the computed scores
hmmcalibate
HMM calibrated with the accurate E-value
statistics
113HMMER
Set of sequences
hmmalign
hmmsearch
HMM
Alignment of all the sequences to the model
List of sequences that match the HMM (sorted by
E-value)
Set of HMMs
Sequence
hmmpfam
List of HMMs that match the sequence
114MSF Format globins50.msf
115hmmbuild globin.hmm globins50.msf
Alignment of a protein family
hmmbuild
All the transition and emission parameters are
estimated by means of the Expectation
Maximisation algorithm on the aligned sequences.
Trained profile-HMM
In principle we could use also NON aligned
sequences to train the model. Nevertheless it is
more efficient to build the starting alignment
using, for example, CLUSTALW
116hmmcalibrate -num N -histfile globin.histo
globin.hmm
Trained profile-HMM
A number of N (default 5000) random sequences are
generated and scored with the model.
hmmcalibate
HMM calibrated with the accurate E-value
statistics
Random sequences
Range for globin sequences
E-value(S) expected number of random sequences
with a score gt S
Log P(sM)/P(sN)
117Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
118Trained model (globin.hmm) null model
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Score INT 1000 log2(prob/null_prob) null_p
rob 1 for transitions
119Trained model (globin.hmm) null model
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Score INT 1000 log2(prob/null_prob)
natural abundance for emissions
120Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Transitions
121Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
122Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
123Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
124hmmemit -n N globin.hmm
Trained profile-HMM
The parameters of the model are used to generate
new sequences
hmmemit
Sequences generated by the model
125hmmsearch globin.hmm Artemia.fa gt Artemia.globin
Set of sequences
Trained profile-HMM
hmmsearch
List of sequences that match the HMM (sorted by
E-value)
126Search results (Artemia.globin)
Sequence Description
Score E-value N -------- -----------
----- -------
--- S13421 S13421
474.3 1.7e-143 9 Parsed for
domains Sequence Domain seq-f seq-t hmm-f
hmm-t score E-value -------- ------- -----
----- ----- ----- ----- ------- S13421
7/9 932 1075 .. 1 143 76.9
7.3e-24 S13421 2/9 153 293 .. 1
143 63.7 6.8e-20 S13421 3/9 307
450 .. 1 143 59.8 9.8e-19 S13421
8/9 1089 1234 .. 1 143 57.6
4.5e-18 S13421 9/9 1248 1390 .. 1
143 52.3 1.8e-16 S13421 1/9 1
143 . 1 143 51.2 4e-16 S13421
4/9 464 607 .. 1 143 46.7
8.6e-15 S13421 6/9 775 918 .. 1
143 42.2 2e-13 S13421 5/9 623
762 .. 1 143 23.9 6.6e-08 Alignments
of top-scoring domains S13421 domain 7 of 9,
from 932 to 1075 score 76.9, E 7.3e-24
-gteekalvksvwgkveknveevGaeaLerllvvyPetk
ryFpkFkdLss e a vk w v
vG l P FpkF d S13421
932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFS
DVPL 978 adavkgsakvkahgkkVlt
algdavkkldd...lkgalakLselHaqklr
d v h V tl ld l Le H
lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTL
DEpanLALLARQLGEDHIV-LR 1026
vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYklt
v fk vl l lg f
k S13421 1027
VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ
1075
Number of domains
Domains sorted by E-value
Start
End
Consensus sequence
Sequence
127hmmalign globin.hmm globins630.fa
Set of sequences
Trained profile-HMM
hmmalign
Alignment of all sequences to the model
Insertions
Gaps
BAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.
D.AATDDILDAWGKAYGV GLB1_ANABR
TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAV
GLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V
.D.EFT------VQSFGEV GLB1_CALSO
TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDV
GLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-
.T.KWDSAVESSWAPVLDF GLB1_GLYDI
GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYAD
GLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G
.T.HFDF---GAWHDCVDQ GLB1_MORMR
QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSM
GLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G
.S.DFNQKTRDSWAKAFSI GLB1_PETMA
KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSM
GLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiG
lTmHFD---QAAWRDCYAV
128HMMER applicationsPFAM http//www.sanger.ac.uk/So
ftware/Pfam/
129(No Transcript)
130(No Transcript)
131PFAM Exercise
Generate with hmmemit a sequence from the globin
model and search it in PFAM database
132PFAM Exercise
Search in the SwissProt database the
sequences CG301_HUMAN Q9H5F4_HUMAN 1) search
them in the PFAM data base. 2)launch PSI-BLAST
searches. Is it possible to annotate the
sequences by means of the BLAST results?
133SAM at UCSDhttp//www.soe.ucsc.edu/research/compb
io/sam.html
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
134SAM applications http//www.cse.ucsc.edu/research
/compbio/HMM-apps/T02-query.html
135HMMPRO http//www.netid.com/html/hmmpro.html Pier
re Baldi, Net-ID
136HMMs for Mapping problems
- Mapping problems in protein prediction
137Secondary structure
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
138Topology of membrane proteins
Topography
position of Trans Membrane Segments along the
sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
ALALMLCMLTYRHKELKLKLKK
139HMMs for Mapping problems
- Mapping problems in protein prediction
- Labelled HMMs
140HMM for secondary structure prediction
Simplest model
Introducing a grammar
a1
b1
c
a2
b2
a3
141HMM for secondary structure prediction
Labels The states a1, a2 and a3 share the same
label, so states b1 and b2 do. Decoding the
Viterbi path for emitting a sequence s, makes a
mapping between the sequence s and a sequence of
labels y S A L K M N Y T R E I M V A S N Q s
Sequence c a1 a2 a3 a4 a4 a4 c c c c
b1 b2 b2 b2 c c p Path c a a a a a
a c c c c b b b b c
c Y(p) Labels
a1
b1
c
a2
b2
a3
142Computing P(s, y M)
Only the path whose labelling is y have to be
considered in the sum In Forward and Backward
algorithms it means to set Fk(i) 0, Bk(i)
0 if Y(k) ? yi
143Baum-Welch training algorithm for labelled HMMs
Given a set of known labelled sequences (e.g.
amino acid sequences and their native secondary
structure) we want to find the parameters of the
model, without knowing the generating paths
qML argmaxq P ( s, y q, M) The algorithm
is the same as in the non-labelled case if we use
the Forward and Backward matrices defined in the
last slide.
Supervised learning of the mapping
144HMM