Title: Variants of HMMs
1Variants of HMMs
2Higher-order HMMs
- How do we model memory larger than one time
point? - P(?i1 l ?i k) akl
- P(?i1 l ?i k, ?i -1 j) ajkl
-
- A second order HMM with K states is equivalent to
a first order HMM with K2 states
aHHT
state HH
state HT
aHT(prev H) aHT(prev T)
aHTH
state H
state T
aHTT
aTHH
aTHT
state TH
state TT
aTH(prev H) aTH(prev T)
aTTH
3Modeling the Duration of States
1-p
- Length distribution of region X
- ElX 1/(1-p)
- Geometric distribution, with mean 1/(1-p)
- This is a significant disadvantage of HMMs
- Several solutions exist for modeling different
length distributions
X
Y
p
q
1-q
4Soln 1 Chain several states
p
1-p
X
Y
X
X
q
1-q
Disadvantage Still very inflexible lX C
geometric with mean 1/(1-p)
5Soln 2 Negative binomial distribution
p
p
p
1 p
1 p
1 p
Y
X
X
X
- Duration in X m turns, where
- During first m 1 turns, exactly n 1 arrows to
next state are followed - During mth turn, an arrow to next state is
followed - m 1 m 1
- P(lX m) n 1 (1
p)n-11p(m-1)-(n-1) n 1 (1 p)npm-n
6Example genes in prokaryotes
- EasyGene
- Prokaryotic
- gene-finder
- Larsen TS, Krogh A
- Negative binomial with n 3
7Solution 3 Duration modeling
- Upon entering a state
- Choose duration d, according to probability
distribution - Generate d letters according to emission probs
- Take a transition to next state according to
transition probs - Disadvantage Increase in complexity
-
- Time O(D2) -- Why?
- Space O(D)
- where D maximum duration of state
X
8Connection Between Alignment and HMMs
9A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
10Lets score the transitions
s(xi, yj)
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
s(xi, yj)
s(xi, yj)
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
11How do we find optimal alignment according to
this model?
- Dynamic Programming
- M(i, j) Optimal alignment of x1xi to y1yj
ending in M - I(i, j) Optimal alignment of x1xi to y1yj
ending in I - J(i, j) Optimal alignment of x1xi to y1yj
ending in J - The score is additive, therefore we can apply DP
recurrence formulas
12Needleman Wunsch with affine gaps state version
- Initialization
- M(0,0) 0 M(i,0) M(0,j) -?, for i, j gt 0
- I(i,0) d i?e J(0,j) d j?e
- Iteration
- M(i 1, j 1)
- M(i, j) s(xi, yj) max I(i 1, j 1)
- J(i 1, j 1)
-
- e I(i 1, j)
- I(i, j) max e J(i, j 1)
- d M(i 1, j 1)
- e I(i 1, j)
- J(i, j) max e J(i, j 1)
- d M(i 1, j 1)
- Termination
13Probabilistic interpretation of an alignment
- An alignment is a hypothesis that the two
sequences are related by evolution - Goal
- Produce the most likely alignment
- Assert the likelihood that the sequences are
indeed related
14A Pair HMM for alignments
BEGIN
?
?
?
1 2? ?
M
I
J
?
?
?
END
15A Pair HMM for unaligned sequences
Model R
1 - ?
1 - ?
J P(yj)
I P(xi)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?
- P(x, y R) ?(1 ?)m P(x1)P(xm) ?(1 ?)n
P(y1)P(yn) -
- ?2(1 ?)mn ?i P(xi) ?j P(yj)
16To compare ALIGNMENT vs. RANDOM hypothesis
1 2? ?
- Every pair of letters contributes
- (1 2? ?) P(xi, yj) when matched
- ? P(xi) P(yj) when gapped
- (1 ?)2 P(xi) P(yj) in random model
- Focus on comparison of
- P(xi, yj) vs. P(xi) P(yj)
M P(xi, yj)
1 2? ?
1 2? ?
?
?
?
?
I P(xi)
J P(yj)
?
?
1 - ?
1 - ?
I P(xi)
J P(yj)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?
17To compare ALIGNMENT vs. RANDOM hypothesis
- Idea
- We will divide alignment score by the random
score, - and take logarithms
- Let
- P(xi, yj) (1 2? ?)
- s(xi, yj) log log
- P(xi) P(yj) (1 ?)2
- ?(1 2? ?) P(xi)
- d log
- (1 ?) (1 2? ?) P(xi)
- ? P(xi)
- e log
- (1 ?) P(xi)
Every letter b in random model contributes
(1 ?) P(b)
18The meaning of alignment scores
- Because ?, ?, are small, and ?, ? are very small,
- P(xi, yj) (1 2? ?)
P(xi, yj) - s(xi, yj) log log ?
log log(1 2?) - P(xi) P(yj) (1 ?)2
P(xi) P(yj) - ?(1 ? ?) 1 ?
- d log ?
log ? - (1 ?) (1 2? ?) 1 2?
- ?
- e log ? log ?
- (1 ?)
19The meaning of alignment scores
- The Viterbi algorithm for Pair HMMs corresponds
exactly to the Needleman-Wunsch algorithm with
affine gaps - However, now we need to score alignment with
parameters that add up to probability
distributions - ? 1/mean length of next gap
- ? 1/mean arrival time of next gap
- affine gaps decouple arrival time with length
- ? 1/mean length of aligned sequences (set to 0)
- ? 1/mean length of unaligned sequences (set to
0)
20The meaning of alignment scores
- Match/mismatch scores
- P(xi, yj)
- s(a, b) ? log (lets ignore log(1
2?) for the moment assume no gaps) - P(xi) P(yj)
- Example
- Say DNA regions between human and mouse have
average conservation of 50 -
- Then P(A,A) P(C,C) P(G,G) P(T,T) 1/8
(so they sum to ½) - P(A,C) P(A,G) P(T,G)
1/24 (24 mismatches, sum to ½) - Say P(A) P(C) P(G) P(T) ¼
- log (1/8) / (1/4
1/4) log 2 1, for match - Then, s(a, b) log (1/24) / (1/4 1/4)
log 16/24 -0.585 - Cutoff similarity that scores 0 s1 (1
s)0.585 0 - According to this model, a 37.5-conserved
sequence with no gaps would score on average
21Substitution matrices
- A more meaningful way to assign match/mismatch
scores - For protein sequences, different substitutions
have dramatically different frequencies! - PAM Matrices
- Start from a curated set of very similar protein
sequences - Construct ancestral sequences (using parsimony)
- Calculate Aab frequency of letters a and b
interchanging - Calculate Bab P(ba) Aab/(?cd Acd)
- Adjust matrix B so that ?a,b qa qb Bab 0.01
- PAM(1)
- Let PAM(N) PAM(1)N -- Common PAM(250)
22Substitution Matrices
- BLOSUM matrices
- Start from BLOCKS database (curated, gap-free
alignments) - Cluster sequences according to gt X identity
- Calculate Aab of aligned a-b in distinct
clusters, correcting by 1/mn, where m, n are the
two cluster sizes - Estimate
- P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
Acd)
23BLOSUM matrices
BLOSUM 50
BLOSUM 62
(The two are scaled differently)