PROLOGUE:

About This Presentation

Title:

PROLOGUE:

Description:

PROLOGUE: Pitfalls of standard alignments Dynamic programming filtering procedure Maximum-scoring subsequences with constrained segment length and number www.cbs.dtu ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 179

Provided by: bio125

Category:

more less

Transcript and Presenter's Notes

Title: PROLOGUE:

1
PROLOGUE Pitfalls of standard alignments
2
Scoring a pairwise alignment
A ALAEVLIRLITKLYP B ASAKHLNRLITELYP
Blosum62
3
Alignment of a family (globins)
.
Different positions are not equivalent
4
Sequence logos
http//weblogo.berkeley.edu/cache/file5h2DWc.png
The substitution score IN A FAMILY should depend
on the position (the same for gaps)
For modelling families we need more flexible tools
5
Probabilistic Models for Biological Sequences

What are they?

6
Probabilistic models for sequences

Generative definition
Objects producing different outcomes (sequences)
with different probabilities
The probability distribution over the sequences
space determines the model specificity

Probability
Sequence space
M
Generates si with probability P(si M) e.g. M
is the representation of the family of globins
7
Probabilistic models for sequences

We dont need a generator of new biological
sequences
the generative definition is useful as operative
definition

Associative definition
Objects that, given an outcome (sequence),
compute a probability value

Probability
M
Sequence space
Associates probability P(si M) to si e.g. M
is the representation of the family of globins
8
Probabilistic models for sequences
Most useful probabilistic models are Trainable
systems The probability density function over
the sequence space is estimated from known
examples by means of a learning algorithm
Known examples
Pdf estimate (generalization)
Probability
Probability
Sequence space
Sequence space
e.g. Writing a generic representation of the
sequences of globins starting from a set of known
globins
9
Probabilistic Models for Biological Sequences

What are they?
Why to use them?

10
Modelling a protein family
Probabilistic model
Given a protein class (e.g. Globins), a
probabilistic model trained on this family can
compute a probability value for each new sequence
This value measures the similarity between the
new sequence and the family described by the model
11
Probabilistic Models for Biological Sequences

What are they?
Why to use them?
Which probabilities do they compute?

12
P( s M ) or P( M s ) ?
A model M associates to a sequence s the
probability P( s M ) This probability
answers the question Which is the probability
for a model describing the Globins to generate
the sequence s ? The question we want to answer
is Given a sequence s, is it a Globin? We
need to compute P( M s ) !!
13
Bayes Theorem
P(X,Y) P(X Y) P(Y) P(Y X) P(X) Joint
probability So
P(X Y) P(Y)
P(Y X)
P(X)
P(s M) P(M)
A priori probabilities
P(M s)
P(s)
P(s M)
P(M s)
Evidence M
Conclusion M
Conclusion s
Evidence s
14
The A priori probabilities
P(s M) P(M)
A priori probabilities
P(M s)
P(s)
P(M) is the probability of the model (i.e. of the
class described by the model) BEFORE we know the
sequence can be estimated as the abundance of
the class
P(s) is the probability of the sequence in the
sequence space. Cannot be reliably estimated!!
15
Comparison between models
We can overcome the problem comparing the
probability of generating s from different models

Ratio between the abundance of the classes
16
Null model
Otherwise we can score a sequence for a model M
comparing it to a Null Model a model that
generates ALL the possible sequences with
probabilities depending ONLY on the statistical
amino acid abundance
P(s M)
S(M, s) log
P(s N)
Sequences NOT belonging to model M
Sequences belonging to model M
S(M, s)
In this case we need a threshold and a statistic
for evaluating the significance (E-value, P-value)
17
The simplest probabilistic models Markov Models

Definition

18
Markov Models Example Weather
Register the weather conditions day by day as a
first hypothesis the weather condition in a day
depends ONLY on the weather conditions in the
day before. Define the conditional
probabilities P(CC), P(CR),. P(RC)..
R
C
S
F
C Clouds R Rain F Fog S Sun
The probability for the 5-days registration
CRRCS P(CRRCS) P(C)P(RC) P(RR) P(CR)
P(SC)
19
Markov Model
Stochastic generator of sequences in which the
probability of state in position i depends ONLY
on the state in position i-1 Given an alphabet
C c1 c2 c3 cN a Markov model is
described with N(N2) parameters art , aBEGIN
t , ar END r, t ? C arq P( s i q s i-1
r ) aBEGIN q P( s 1 q ) ar END P( s T
END s T-1 r )
c2
c1
c3
END
BEGIN
?t art ar END 1 ? r ?t aBEGIN t 1
cN
c4
20
Markov Models
Given the sequence s s1s2s3s4s6
sT with si ? C c1 c2 c3 cN
P( s M ) P( s1 ) ? ?i2 P( s i s i-1 )

P(ALKALI) aBEGIN A ? aA L ? aL K ? aK A ? aA L
? aL I ? aI END
21
Markov Models Exercise
1) Fill the non defined values for the transition
probabilities
22
Markov Models Exercise
2) Which model better describes the weather in
summer? Which one describes the weather in
winter?
23
Markov Models Exercise
Winter
3) Given the sequence CSSSCFS which model
gives the higher probability? Consider the
starting probabilities P(XBEGIN)0.25
Summer
24
Markov Models Exercise
Winter
P (CSSSCFS Winter) 0.25x0.1x0.2x0.2x0.3x0.2x0
.2 1.2 x 10-5 P (CSSSCFS Summer)
0.25x0.4x0.8x0.8x0.1x0.1x1.0 6.4 x 10-4 4)
Can we conclude that the observation sequence
refers to a summer week?
Summer
25
Markov Models Exercise
Winter
P (Seq Winter) 1.2 x 10-5 P (Seq Summer)
6.4 x 10-4

Summer
P(Seq Summer) P(Summer)

P(Seq Winter) P(Winter)
26
Simple Markov Model for DNA sequences
DNA C Adenine, Citosine, Guanine, Timine
16 transition probabilities (12 of which
independent) 4 Begin probabilities 4 End
probabilities.
The parameters of the model are different in
different zones of DNA They describe the overall
composition and the couple recurrences
27
Example of Markov Models GpC Island
GATGCGTCGC CTACGCAGCG
GpC Islands
Non-GpC Islands
In the Markov Model of GpC Islands aGC is higher
than in Markov Model Non-GpC Islands
Given a sequence s we can evaluate
28
The simplest probabilistic models Markov Models

Definition
Training

29
Probabilistic training of a parametric method
Generally speaking, a parametric model M aims to
reproduce a set of known data
Model M Parameters T
Modelled data
Real data (D)
How to compare them?
30
Training of Markov Models
Let ?M be the set of parameters of model
M. During the training phase, ?M parameters are
estimated from the set of known data D Maximum
Likelihood Extimation (ML) ?ML argmax? P( D
M, ? ) It can be proved that
Frequency of occurrence as counted in the data
set D
Maximum A Posteriori Extimation (MAP) ?MAP
argmax? P( ? M, D ) argmax? P( D M, ? ) ?
P(? )
31
Example (coin-tossing)
Given N tossing of a coin (our data D), the
outcomes are h heads and t tails (Nth) ASSUME
the model P(DM) ph (1- p)t Computing the
maximum likelihood of P(DM)
We obtain that our estimate of p is
p h / (ht) h / N
32
Example (Error measure)
Suppose you think that your data are affected by
a Gaussian error So that they are distributed
according to F(xi)Aexp-(xi m)2 /2s 2 With
A1/sqrt(2? s) If your measures are independent
the data likelihood is
P(Data model) Pi F(xi)
Find m and s that maximize the P(Data model)
33
Maximum Likelihood training Proof
Given a sequence s contained in D s
s1s2s3s4s6 sT
We can count the number of transitions between
any to states j and k njk
Where states 0 and N1 are BEGIN and END
Normalisation contstraints are taken into account
using the Lagrange multipliers lk
34
Hidden Markov Models

Preliminary examples

35
Loaded dice
We have 99 regular dice (R) and 1 loaded die
(L). P(1) P(2) P(3) P(4) P(5) P(6) R 1/6 1/6 1/
6 1/6 1/6 1/6 L 1/10 1/10 1/10 1/10 1/10 1/2
Given a sequence 4156266656321636543662152611536
264162364261664616263 We dont know the sequence
of dice that generated it. RRRRRLRLRRRRRRRLRRRRRR
RRRRRRLRLRRRRRRRRLRRRRLRRRRLRR
36
Loaded dice
Hypothesis We chose a different die for each
roll Two stochastic processes give origin to the
sequence of observations. 1) Choosing the die (
R o L ). 2) Rolling the
die The sequence of dice is hidden The first
process is assumed to be Markovian (in this case
a 0-order MM) The outcome of the second process
depends only on the state reached in the first
process (that is the chosen die)
37
Casinò
Model Each state (R and L) generates a
character of the alphabet C 1, 2, 3, 4, 5, 6
The emission probabilities depend only on the
state. The transition probabilities describe a
Markov model that generates a state path the
hidden sequence (p) The observations sequence
(s) is generated by two concomitant stochastic
processes
0.01
R
L
0.01
0.99
0.99
38
The observations sequence (s) is generated by two
concomitant stochastic processes
4156266656321636543662152611 RRRRRLRLRRRRRRRLRRRRR
RRRRRRR
0.01
R
L
0.01
0.99
0.99
Choose the State R Probability 0.99 Chose
the Symbol 1 Probability 1/6 (given R)
4156266656321636543662152611 RRRRRLRLRRRRRRRLRRRRR
RRRRRRR
39
The observations sequence (s) is generated by two
concomitant stochastic processes
415626665632163654366215261 RRRRRLRLRRRRRRRLRRRRRR
RRRRR
0.01
R
L
0.01
0.99
0.99
Choose the State L Probability 0.99 Chose
the Symbol 5 Probability 1/10 (given L)
41562666563216365436621526115 RRRRRLRLRRRRRRRLRRRR
RRRRRRRRL
40
Loaded dice
Model Each state (R and L) generates a
character of the alphabet C 1, 2, 3, 4, 5, 6
The emission probabilities depend only on the
state. The transition probabilities describe a
Markov model that generates a state path the
hidden sequence (p) The observations sequence
(s) is generated by two concomitant stochastic
processes
41
Some not so serious example
1) DEMOGRAPHY Observable Number of births
and deaths in a year in a village. Hidden
variable Economic conditions (as a first
approximation we can consider the success in
business as a random variable, and by
consequence, the wealth as a Markov variable
---gt can we deduce the economic conditions of a
village during a century by means of the register
of births and deaths? 2) THE METEREOPATHIC
TEACHER Observable Average of the marks
that a metereopathic teacher gives to their
students during a day. Hidden variable
Weather conditions ---gt can we deduce the
weather conditions during a years by means of the
class register?
42
To be more serious
1) SECONDARY STRUCTURE Observable protein
sequence Hidden variable secondary
structure ---gt can we deduce (predict) the
secondary structure of a protein given its amino
acid sequence? 2) ALIGNMENT Observable
protein sequence Hidden variable position
of each residue along the alignment of a protein
family ---gt can we align a protein to a
family, starting from its amino acid sequence?
43
Hidden Markov Models

Preliminary examples
Formal definition

44
Formal definition of Hidden Markov Models

A HMM is a stochastic generator of sequences
characterised by
N states
A set of transition probabilities between two
states akj
akj P( ? (i) j ? (i-1) k )
A set of starting probabilities a0k
a0k P( ? (1) k )
A set of ending probabilities ak0
ak0 P( p (i) END ? (i-1) k )
An alphabet C with M characters.
A set of emission probabilities for each state
ek (c)
ek (c) P( s i c ? (i) k )
Constraints
?k a0 k 1
ak0 ?j ak j 1 ? k
?c? C ek (c) 1 ? k

s sequence p path through the states
45
Generating a sequence with a HMM

46
GpC Island, simple model

s AGCGCGTAATCTG
p YYYYYYYNNNNNN
P( s, p M ) can be easily computed

47
P( s, p M ) can be easily computed
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1?0.25?0.25?0.25?0.25?0.25?0.25 Transition
0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ? 0.2?0.8
? 0.8 ?0.8?0.8? 0.8 ? 0.1
Multiplying all the probabilities gives the
probability of having the sequence AND the path
through the states
48
Evaluation of the joint probability of the
sequence ad the path
49
Hidden Markov Models

Preliminary examples
Formal definition
Three questions

50
GpC Island, simple model

s AGCGCGTAATCTG
p?????????????
P( s, p M ) can be easily computed
How to evaluate P ( s M )?

51
How to evaluate P ( s M )?
P ( s M ) ?p P( s, p M )
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N p5 Y Y Y Y Y Y Y Y
Y Y N Y Y
213 different paths Summing over all the
path will give the probability of having the
sequence
52
Resumé GpC Island, simple model

s AGCGCGTAATCTG
p ?????????????
P( s, p M ) can be easily computed
How to evaluate P ( s M )?
Can we show the hidden path?

53
Can we show the hidden path?
p argmax p P( p s, M ) argmax p P(
p , s M )
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N p5 Y Y Y Y Y Y Y Y
Y Y N Y Y
213 different paths Viterbi path path
that gives the best joint probability
54
Can we show the hidden path?
A Posteriori decoding For each position choose
the state p (t) p (i) argmax k P( p (i)
k s, M ) The contribution to this
probability derives from all the paths that go
through the state k at position i. The A
posteriori path can be a non-sense path (it may
not be a legitimate path if some transitions are
not permitted in the model)
55
GpC Island, simple model

s AGCGCGTAATCTG
p YYYYYYYNNNNNN
P( s, p M ) can be easily computed
How to evaluate P ( s M )?
Can we show the hidden path?
Can we evaluate the parameters starting from
known examples?

56
Can we evaluate the parameters starting from
known examples?
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission eY (A)?eY (G)?eY (C)?e
Y(G)?eY (C)?eY (G)?eY (T)?eN (A)?eN (A)?eN (T)?eN
(C)?eN (T)?eN (G) Transition a0Y ? aYY ?
aYY ? aYY ? aYY ? aYY ? aYY ? aYN ? aNN ? aNN
? aNN ? aNN? aNN ? aN0
How to find the parameters e and a that maximises
this probability? How if we dont know the path?
57
Hidden Markov ModelsAlgorithms

Resumé
Evaluating P(s M) Forward Algorithm

58
Computing P( s,p M ) for each path is a
redundant operation
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y Y
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.4 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.7 ? 0.1
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.25 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.2 ? 0.1
59
Computing P( s,p M ) for each path is a
redundant operation
If we compute the common part only once we gain
2(T-1) operations
60
Summing over all the possible paths
s A G p Y Y
s A G p N Y
Emission 0.1 ? 0.4 Transition 0.2
? 0.7
Emission 0.25?0.4 Transition 0.8 ?
0.1
0.0056
0.008
s A G p Y N
s A G p N N
Emission 0.1 ? 0.25 Transition 0.2
? 0.2
Emission 0.25?0.25 Transition 0.8
? 0.8
0.04
0.001
61
Summing over all the possible paths
s A G C p X Y Y
s A G C p X N Y
0.4 0.7
0.4 0.1
?
?
0.0136
0.041

s A G C p X Y N
s A G C p X N N
0.25 0.2
0.25 0.8
?
?
0.0136
0.041

62
Summing over all the possible paths
Iterating until the last position of the sequence
A G C G C G T A A T C T G X X X X
X X X X X X X X Y
?
0.1 (aY0)

A G C G C G T A A T C T G X X X X
X X X X X X X X N
?
0.1 (aN0)
P(sM)
63
Summing over all the possible paths
If we know the probabilities of emitting the two
first characters of the sequence ending the path
in states L and R respectively FR(2) ?
P(s1,s2,p(2) R M) and FL(2) ? P(s1,s2,p(2)
L M) then we can compute P(s1,s2,s3,p(3)
R M) FR(2) aRR eR(s3) FL(2) aLR
eR(s3)
64
Forward Algorithm
On the basis of preceding observations the
computation of P(s M) can be decomposed in
simplest problems For each state k and each
position i in the sequence, we compute Fk(i)
P( s1s2s3s i, p (i) k M)
Initialisation FBEGIN (0) 1 Fi (0)
0 ? i ? BEGIN Recurrence Fl ( i1) P(
s1s2s is i1, p (i 1) l )
?k P( s1s2 s i, p (i) k ) ? a k l ? e l
( s i1 ) e l ( s i1 )
? ?k Fk ( i ) ? a k l Termination P( s ) P(
s1s2s3s T, p (T 1) END )
?k P( s1s2 s T , p (T) k ) ? a k0
?k Fk ( T ) ? a k 0
Will be understood
65
Forward Algorithm
STATE
Fi (1) aiB
P(s M)
END
eB (s2)
FB (2)
B
A
S
BEGIN
0
1
T
T 1
Iteration
66
Forward algorithm computational complexity

Naïf method
P ( s M ) ?p P( s, p M )
There are N T possible paths.
Each path requires about 2?T operations.
The time for the computation is O( T? N T )

s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y Y
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.4 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.7 ? 0.1
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y Y Y Y Y Y N
Emission 0.1 ? 0.4 ?0.4 ? 0.4 ?0.4 ?0.4
?0.1? 0.1 ? 0.1 ? 0.1 ? 0.4 ? 0.1 ?
0.25 Transition 0.2 ? 0.7 ? 0.7 ? 0.7 ? 0.7 ?
0.7 ? 0.7 ? 0.7?0.7 ? 0.7 ?0.7?0.7? 0.2 ? 0.1
67
Forward algorithm computational complexity

Forward algorithm
T positions, N values for each position
Each element requires about 2?N product and 1
sum
The time for the computation is O(T? N2)

s A G C p X Y Y
s A G C p X N Y
s A G C p X X Y
0.4 0.7
0.4 0.1
?
?
0.0136
0.041

0.005448
Sum
s A G C p X X N
s A G C p X Y N
s A G C p X N N
0.25 0.2
0.25 0.8
?
?
0.0136
0.041
0.00888

68
Forward algorithm computational complexity
Naïf method
Forward algorithm
69
Hidden Markov ModelsAlgorithms

Resumé
Evaluating P(s M) Forward Algorithm
Evaluating P(s M) Backward Algorithm

70
Backward Algorithm
Similar to the Forward algorithm it computes P(
s M ), reconstructing the sequence from the
end For each state k and each position i in the
sequence, we compute Bk(i) P( s i1s i2s
i3s T p (i) k ) Initialisation Bk (T)
P(p (T1) END p (T) k ) ak0 Recurrence
Bl ( i-1) P(s is i1s T p (i - 1) l )
?k P(s i1s i2s T p (i) k) ?
a l k ? e k (s i ) ?k Bk ( i ) ? e
k ( s i ) ? a l k Termination P( s ) P(
s1s2s3s T p (0) BEGIN ) ?k P(
s2 s T p (1) k ) ? a 0 k ? e k ( s 1 )
?k Bk ( 1 ) ? a 0k ? e k ( s 1 )
71
Backward Algorithm
STATE
Bk(T) aB T e k (s T-1 )
END
BB (T-1)
B
A
S
BEGIN
0
1
2
T
T 1
T-1
P(s M)
Iteration
72
Hidden Markov ModelsAlgorithms

Resumé
Evaluating P(s M) Forward Algorithm
Evaluating P(s M) Backward Algorithm
Showing the path Viterbi decoding

73
Finding the best path
s A G p Y Y
s A G p N Y
Emission 0.1 ? 0.4 Transition 0.2
? 0.7
Emission 0.25?0.4 Transition 0.8 ?
0.1
0.0056
0.008
s A G p Y N
s A G p N N
Emission 0.1 ? 0.25 Transition 0.2
? 0.2
Emission 0.25?0.25 Transition 0.8
? 0.8
0.04
0.001
74
Finding the best path
s A G C p N Y Y
s A G C p N N Y
0.4 0.7
0.4 0.1
?
?
0.008
0.04
0.00224
0.0016
s A G C p N Y N
s A G C p N N N
0.25 0.2
0.25 0.8
?
?
0.008
0.04

0.0004
0.008
75
Finding the best path
Iterating until the last position of the sequence
A G C G C G T A A T C T G N Y Y Y
Y Y Y N N N Y Y Y
?
0.1 (aY0)
A G C G C G T A A T C T G N N N N
N N N N N N N N N
?
0.1 (aN0)
Choose the Maximum
76
Viterbi Algorithm
p argmax p P( p , s M ) The computation
of P(s,p M) can be decomposed in simplest
problems Let Vk(i) be the probability of the
most probable path for generating the subsequence
s1s2s3s i ending in the state k at iteration
i Initialisation VBEGIN (0) 1 Vi (0)
0 ? i ? BEGIN Recurrence Vl ( i1) e l ( s
i1 ) ? Max k ( Vk ( i ) ? a k l ) ptr i ( l )
argmax k ( Vk ( i ) ? a k l ) Termination
P( s, p ) Maxk (Vk ( T ) ? a k 0 ) p ( T )
argmax k (Vk ( T ) ? a k 0 ) Traceback p ( i-1
) ptr i (p ( i ))
77
Viterbi Algorithm
STATE
Vi (1) aiB
P(s,? M)
END
eB (s2)
VB (2)
B
A
MAX
BEGIN
0
1
T
T 1
Iteration
ptr2 (B)
78
Viterbi Algorithm
Viterbi path
Different paths can have the same probability
79
Hidden Markov ModelsAlgorithms

Resumé
Evaluating P(s M) Forward Algorithm
Evaluating P(s M) Backward Algorithm
Showing the path Viterbi decoding
Showing the path A posteriori decoding
Training a model EM algorithm

80
If we know the path generating the training
sequence
s A G C G C G T A A T C T G p Y
Y Y Y Y Y Y N N N N N N
Emission eY (A)?eY (G)?eY (A)?e
Y(G)?eY (C)?eY (G)?eY (T)?eN (A)?eN (A)?eN (T)?eN
(C)?eN (T)?eN (G) Transition a0Y ? aYY ?
aYY ? aYY ? aYY ? aYY ? aYY ? aYN ? aNN ? aNN
? aNN ? aNN? aNN ? aN0
Just count! Example aYY nYY /(nYY nYN)
6/7 eY(A) nY(A) /nY(A)nY(C) nY(G) nY(T)
1/7
81
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood
parameters when the paths generating the training
sequences are unknown qML argmaxq P ( s q,
M) Given a model with parameters q0 the EM
algorithm finds new parameters q that increase
the likelihood of the model P( s q ) gt P( s
q0 )
82
Expectation-Maximisation algorithm
s A G C G C G T A A T C T G p1
Y Y Y Y Y Y Y Y Y Y Y Y Y p2 Y Y
Y Y Y Y Y Y Y Y Y Y N p3 Y Y Y Y
Y Y Y Y Y Y Y N Y p4 Y Y Y Y Y Y
Y Y Y Y Y N N ...
Given a path p we can count the number of
transition between states k and l Ak,l(p) the
number on emissions of character c from state k
Ek (c,p)
We can compute the expected values over all the
paths
Ak,l Sp P(p s,q0) Ak,l(p) Ek (c) Sp P(p
s,q0) Ek (c,p)
The updated parameters are
ak,l ek(c)
83
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood
parameters when the paths generating the training
sequences are unknown qML argmaxq P ( s q,
M) Given a model with parameters q0 the EM
algorithm finds new parameters q that increase
the likelihood of the model P( s q ) gt P( s
q0 ) or equivalentely log P( s q ) gt log P(
s q0 )
84
Expectation-Maximisation algorithm
log P( s q ) log P(s,p q ) - log P(p s,q
) Multiplying for P(p s,q0) and summing over
all the possible paths log P( s q ) Sp
P(p s,q0) log P(s,p q ) - Sp P(p s,q0)
log P(p s,q)
Q(q q0) Expectation value of log P(s,p q )
over all the current paths
log P( s q ) - log P(s q0) Q(q q0)
- Q(q0q0) ? Q(q q0) - Q(q0q0)
? 0
85
Expectation-Maximisation algorithm
The EM algorithm is an iterative process Each
iteration performs two steps E-step evaluation
of Q(q q0) Sp P(p s,q0) log P(s,p q )
M-step Maximisation of Q(q q0) over all q It
does NOT assure to converge to the GLOBAL Maximum
Likelihood
86
Baum-Welch implementation of the EM algorithm
E-step Q( q q0) Sp P(p s,q0) log P(s,p q
) P(s,p q ) a0,p(1) Pi 1 ap(i),p(i1)
ep(i)(si) Pk 0 Pl 1 ak,l
Pk 1 Pc ? C ek (c)
T
N
N
N
Ak.l (p)
Ek (c,p)
Ak,l Sp P(p s,q0) Ak,l(p) Ek (c) Sp P(p
s,q0) Ek (c,p)
Expected values over all the actual paths
87
Baum-Welch implementation of the EM algorithm
M-step
For any state k and character c, with Sc ek(c)
1
By means of Lagranges multipliers techniques, we
can solve the system
ak,l ek(c)
88
Baum-Welch implementation of the EM algorithm
How to compute the expected number of transitions
and emissions over all the paths
Fk(i) P( s1s2s3s i, p (i) k ) Bk(i) P(
s i1s i2s i3s T p (i) k ) Ak,l P(p
(i ) k , p ( i 1) l s,q) Ek (c) P(
s i c , p (i ) k s,q)
89
Baum-Welch implementation of the EM algorithm
Algorithm
Start with random parameters
Compute Forward and Backward matrices on the
known sequences
Compute Ak,l and Ek (c) expected numbers of
transitions and emissions
Update a k,l? Ak,l ek (c) ? Ek (c)
Has P(sM) incremented ?
Yes
No
End
90
Profile HMMs

HMMs for alignments

91
Profile HMMs

HMMs for alignments

92
How to align?
M5
M0
M1
M2
M3
M4
Each state represent a position in the alignment.
A C G G T A M0 M1 M2 M3 M4 M5 A C G A T C M0 M1 M
2 M3 M4 M5 A T G T T C M0 M1 M2 M3 M4 M5
Each position has a peculiar composition
93
Given a set of sequences..
A C G G T A A C G A T C A T G T T C
..we can train a model..
M5
M0
M1
M2
M3
M4
..estimating the emission probabilities.
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
94
Given a trained model..
M5
M0
M1
M2
M3
M4
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
..we can align a new sequence..
A C G A T C
..computing the emission probability
P(sM) 1 0.66 1 0.33 1 0.66
95
And for the sequence AGATC ?
M5
M0
M1
M2
M3
M4
A 1 0 0 0.33 0 0.33 C 0 0.66 0 0 0 0.66 G 0 0 1 0.
33 0 0 T 0 0.33 0 0.33 1 0
A G A T C M0 M2 M3 M4 M5 M5
We need a way to introduce gaps
96
Silent states
Red transitions allow gaps (N-1) ! transitions
To reduce the number of parameters we can use
states that doesnt emit any character 4N-8
transitions
97
Profile HMMs
Delete states
Insert states
Match states
A C G G T A M0 M1
M2 M3 M4 M5 A C
G C A G T C M0 I0 I0 M1
M2 M3 M4 M5 A
G A T C M0 D1
M2 M3 M4 M5
98
Example of alignment

Sequence 1
A S T R A L
Viterbi path
M0 M1 M2 M3 M4 M5
A S T R A L
Sequence 2
A S T A I L
Viterbi path
M0 M1 M2 D3 M4 I4 M5
A S T A I L
Sequence 3
A R T I
Viterbi path
M0 M1 M2 D3 D4 M5
A R T I

99
Example of alignment
-Log P(s M) Is an alignment score
100
Searching for a structural/functional pattern in
protein sequence
Zn binding loop
C H C I C R I C C H C L C
K I C C H C I C S L C D H
C L C T I C C H C I D S
I C C H C L C K I C
Cysteines can be replaced by an Aspartic Acid,
but only ONCE for each sequence
101
Searching for a structural/functional pattern in
protein sequences
..ALCPCHCLCRICPLIY..
obtains higher probability than
..WERWDHCIDSICLKDE..
.. because M0 and M4 have low emission
probability for Aspartic Acid and we multiply
them twice.
102
Profile HMMs

HMMs for alignments
Example on globins

103
Structural alignment of globins
104
Structural alignment of globins
Bashdorf D, Chothia C Lesk AM, (1987)
Determinants of a protein fold unique features
of the globin amino sequence. J.Mol.Biol. 196,
199-216
105
Alignment of globins reconstructed with profile
HMMs
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
106
Discrimination power of profile HMMs
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
107
Profile HMMs

HMMs for alignments
Example on globins
Other applications

108
Finding a domain
Profile HMM specific for the considered domain
Begin
End
109
Clustering subfamilies
HMM 1
HMM 2
.END.
BEGIN
HMM 3
HMM n
Each sequence s contributes to update HMM i with
a weight equal to P ( s Mi )
110
Profile HMMs

HMMs for alignments
Example on globins
Other applications
Available codes and servers

111
HMMER at WUSTL http//hmmer.wustl.edu/
Eddy SR (1998) Profile hidden Markov models.
Bioinformatics 14755-763
112
HMMER
Alignment of a protein family
Takes the aligned sequences, checks for
redundancy and sets the emission and the
transitions probabilities of a HMM
hmmbuild
Trained profile-HMM
Takes a trained HMM, generates a great number of
random sequences, score them and fits the Extreme
Value Distribution to the computed scores
hmmcalibate
HMM calibrated with the accurate E-value
statistics
113
HMMER
Set of sequences
hmmalign
hmmsearch
HMM
Alignment of all the sequences to the model
List of sequences that match the HMM (sorted by
E-value)
Set of HMMs
Sequence
hmmpfam
List of HMMs that match the sequence
114
MSF Format globins50.msf
115
hmmbuild globin.hmm globins50.msf
Alignment of a protein family
hmmbuild
All the transition and emission parameters are
estimated by means of the Expectation
Maximisation algorithm on the aligned sequences.
Trained profile-HMM
In principle we could use also NON aligned
sequences to train the model. Nevertheless it is
more efficient to build the starting alignment
using, for example, CLUSTALW
116
hmmcalibrate -num N -histfile globin.histo
globin.hmm
Trained profile-HMM
A number of N (default 5000) random sequences are
generated and scored with the model.
hmmcalibate
HMM calibrated with the accurate E-value
statistics
Random sequences
Range for globin sequences
E-value(S) expected number of random sequences
with a score gt S
Log P(sM)/P(sN)
117
Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
118
Trained model (globin.hmm) null model
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Score INT 1000 log2(prob/null_prob) null_p
rob 1 for transitions
119
Trained model (globin.hmm) null model
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Score INT 1000 log2(prob/null_prob)
natural abundance for emissions
120
Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Transitions
121
Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
122
Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
123
Trained model (globin.hmm)
HMMER2.0 2.3.2 NAME globins50 LENG 143 ALPH
Amino RF no CS no MAP yes COM
/home/gigi/bin/hmmbuild globin.hmm
globins50.msf COM /home/gigi/bin/hmmcalibrate
--histfile globin.histo globin.hmm NSEQ 50 DATE
Sun May 29 190318 2005 CKSUM 9858 XT -8455
-4 -1000 -1000 -8455 -4 -8455 -4
NULT -4 -8455 NULE 595 -1558 85
338 -294 453 -1158 197 249 902
-1085 -142 -21 -313 45 531 201
384 -1998 -644 EVD -38.893742
0.243153 HMM A C D E F
G H I K L M N
P Q R S T V W
Y m-gtm m-gti m-gtd i-gtm
i-gti d-gtm d-gtd b-gtm m-gte -450
-1900 1 591 -1587 159 1351
-1874 -201 151 -1600 998 -1591 -693
389 -1272 595 42 -31 27 -693
-1797 -1134 14 - -149 -500 233
43 -381 399 106 -626 210 -466
-720 275 394 45 96 359 117
-369 -294 -249 - -23 -6528 -7571
-894 -1115 -701 -1378 -450 2
-926 -2616 2221 2269 -2845 -1178 -325
-2678 -300 -2596 -1810 220 -1592 939
-974 -671 -939 -2204 -2785 -1925 15
- -149 -500 233 43 -381 399
106 -626 210 -466 -720 275 394
45 96 359 117 -369 -294 -249
- -23 -6528 -7571 -894 -1115 -701
-1378 3 -638 -1715 -680
497 -2043 -1540 23 -1671 2380 -1641
-840 -222 -1595 437 1040 -564 -523
-1363 2124 -1313 16 - -149 -500
233 43 -381 399 106 -626 210
-466 -720 275 394 45 96 359
117 -369 -294 -249 - -23 -6528
-7571 -894 -1115 -701 -1378
Emissions
124
hmmemit -n N globin.hmm
Trained profile-HMM
The parameters of the model are used to generate
new sequences
hmmemit
Sequences generated by the model
125
hmmsearch globin.hmm Artemia.fa gt Artemia.globin
Set of sequences
Trained profile-HMM
hmmsearch
List of sequences that match the HMM (sorted by
E-value)
126
Search results (Artemia.globin)
Sequence Description
Score E-value N -------- -----------
----- -------
--- S13421 S13421
474.3 1.7e-143 9 Parsed for
domains Sequence Domain seq-f seq-t hmm-f
hmm-t score E-value -------- ------- -----
----- ----- ----- ----- ------- S13421
7/9 932 1075 .. 1 143 76.9
7.3e-24 S13421 2/9 153 293 .. 1
143 63.7 6.8e-20 S13421 3/9 307
450 .. 1 143 59.8 9.8e-19 S13421
8/9 1089 1234 .. 1 143 57.6
4.5e-18 S13421 9/9 1248 1390 .. 1
143 52.3 1.8e-16 S13421 1/9 1
143 . 1 143 51.2 4e-16 S13421
4/9 464 607 .. 1 143 46.7
8.6e-15 S13421 6/9 775 918 .. 1
143 42.2 2e-13 S13421 5/9 623
762 .. 1 143 23.9 6.6e-08 Alignments
of top-scoring domains S13421 domain 7 of 9,
from 932 to 1075 score 76.9, E 7.3e-24
-gteekalvksvwgkveknveevGaeaLerllvvyPetk
ryFpkFkdLss e a vk w v
vG l P FpkF d S13421
932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFS
DVPL 978 adavkgsakvkahgkkVlt
algdavkkldd...lkgalakLselHaqklr
d v h V tl ld l Le H
lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTL
DEpanLALLARQLGEDHIV-LR 1026
vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYklt
v fk vl l lg f
k S13421 1027
VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ
1075
Number of domains
Domains sorted by E-value
Start
End
Consensus sequence
Sequence
127
hmmalign globin.hmm globins630.fa
Set of sequences
Trained profile-HMM
hmmalign
Alignment of all sequences to the model
Insertions
Gaps
BAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.
D.AATDDILDAWGKAYGV GLB1_ANABR
TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAV
GLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V
.D.EFT------VQSFGEV GLB1_CALSO
TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDV
GLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-
.T.KWDSAVESSWAPVLDF GLB1_GLYDI
GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYAD
GLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G
.T.HFDF---GAWHDCVDQ GLB1_MORMR
QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSM
GLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G
.S.DFNQKTRDSWAKAFSI GLB1_PETMA
KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSM
GLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiG
lTmHFD---QAAWRDCYAV
128
HMMER applicationsPFAM http//www.sanger.ac.uk/So
ftware/Pfam/
129
(No Transcript)
130
(No Transcript)
131
PFAM Exercise
Generate with hmmemit a sequence from the globin
model and search it in PFAM database
132
PFAM Exercise
Search in the SwissProt database the
sequences CG301_HUMAN Q9H5F4_HUMAN 1) search
them in the PFAM data base. 2)launch PSI-BLAST
searches. Is it possible to annotate the
sequences by means of the BLAST results?
133
SAM at UCSDhttp//www.soe.ucsc.edu/research/compb
io/sam.html
Krogh A, Brown M, Mian IS, Sjolander K Haussler
D (1994) Hidden Markov Models in computational
biology applications to protein modelling.
J.Mol.Biol. 235, 1501-1531
134
SAM applications http//www.cse.ucsc.edu/research
/compbio/HMM-apps/T02-query.html
135
HMMPRO http//www.netid.com/html/hmmpro.html Pier
re Baldi, Net-ID
136
HMMs for Mapping problems

Mapping problems in protein prediction

137
Secondary structure
Covalent structure TTCCPSIVARSNFNVCRLPGTPEAIC
ATYTGCIIIPGATCPGDYAN
138
Topology of membrane proteins
Topography
position of Trans Membrane Segments along the
sequence
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
ALALMLCMLTYRHKELKLKLKK
139
HMMs for Mapping problems

Mapping problems in protein prediction
Labelled HMMs

140
HMM for secondary structure prediction
Simplest model
Introducing a grammar
a1
b1
c
a2
b2
a3
141
HMM for secondary structure prediction
Labels The states a1, a2 and a3 share the same
label, so states b1 and b2 do. Decoding the
Viterbi path for emitting a sequence s, makes a
mapping between the sequence s and a sequence of
labels y S A L K M N Y T R E I M V A S N Q s
Sequence c a1 a2 a3 a4 a4 a4 c c c c
b1 b2 b2 b2 c c p Path c a a a a a
a c c c c b b b b c
c Y(p) Labels
a1
b1
c
a2
b2
a3
142
Computing P(s, y M)
Only the path whose labelling is y have to be
considered in the sum In Forward and Backward
algorithms it means to set Fk(i) 0, Bk(i)
0 if Y(k) ? yi
143
Baum-Welch training algorithm for labelled HMMs
Given a set of known labelled sequences (e.g.
amino acid sequences and their native secondary
structure) we want to find the parameters of the
model, without knowing the generating paths
qML argmaxq P ( s, y q, M) The algorithm
is the same as in the non-labelled case if we use
the Forward and Backward matrices defined in the
last slide.
Supervised learning of the mapping
144
HMM

Write a Comment

User Comments (0)

About PowerShow.com

PROLOGUE: - PowerPoint PPT Presentation

PROLOGUE:

PROLOGUE: Pitfalls of standard alignments Dynamic programming filtering procedure Maximum-scoring subsequences with constrained segment length and number www.cbs.dtu ... – PowerPoint PPT presentation