Variants of HMMs - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Variants of HMMs

Description:

A second order HMM with K states is equivalent to a first ... We will divide alignment score by the random score, and take logarithms. Let. P(xi, yj) (1 2 ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 24

Provided by: root

Category:

more less

Transcript and Presenter's Notes

Title: Variants of HMMs

1
Variants of HMMs
2
Higher-order HMMs

How do we model memory larger than one time
point?
P(?i1 l ?i k) akl
P(?i1 l ?i k, ?i -1 j) ajkl
A second order HMM with K states is equivalent to
a first order HMM with K2 states

aHHT
state HH
state HT
aHT(prev H) aHT(prev T)
aHTH
state H
state T
aHTT
aTHH
aTHT
state TH
state TT
aTH(prev H) aTH(prev T)
aTTH
3
Modeling the Duration of States
1-p

Length distribution of region X
ElX 1/(1-p)
Geometric distribution, with mean 1/(1-p)
This is a significant disadvantage of HMMs
Several solutions exist for modeling different
length distributions

X
Y
p
q
1-q
4
Soln 1 Chain several states
p
1-p
X
Y
X
X
q
1-q
Disadvantage Still very inflexible lX C
geometric with mean 1/(1-p)
5
Soln 2 Negative binomial distribution
p
p
p
1 p
1 p
1 p
Y
X
X
X

Duration in X m turns, where
During first m 1 turns, exactly n 1 arrows to
next state are followed
During mth turn, an arrow to next state is
followed
m 1 m 1
P(lX m) n 1 (1
p)n-11p(m-1)-(n-1) n 1 (1 p)npm-n

6
Example genes in prokaryotes

EasyGene
Prokaryotic
gene-finder
Larsen TS, Krogh A
Negative binomial with n 3

7
Solution 3 Duration modeling

Upon entering a state
Choose duration d, according to probability
distribution
Generate d letters according to emission probs
Take a transition to next state according to
transition probs
Disadvantage Increase in complexity
Time O(D2) -- Why?
Space O(D)
where D maximum duration of state

X
8
Connection Between Alignment and HMMs
9
A state model for alignment
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
I (1, 0)
J (0, 1)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
10
Lets score the transitions
s(xi, yj)
M (1,1)
Alignments correspond 1-to-1 with sequences of
states M, I, J
s(xi, yj)
s(xi, yj)
-d
-d
I (1, 0)
J (0, 1)
-e
-e
-e
-e
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMM
MMMIIMMMMMIII
11
How do we find optimal alignment according to
this model?

Dynamic Programming
M(i, j) Optimal alignment of x1xi to y1yj
ending in M
I(i, j) Optimal alignment of x1xi to y1yj
ending in I
J(i, j) Optimal alignment of x1xi to y1yj
ending in J
The score is additive, therefore we can apply DP
recurrence formulas

12
Needleman Wunsch with affine gaps state version

Initialization
M(0,0) 0 M(i,0) M(0,j) -?, for i, j gt 0
I(i,0) d i?e J(0,j) d j?e
Iteration
M(i 1, j 1)
M(i, j) s(xi, yj) max I(i 1, j 1)
J(i 1, j 1)
e I(i 1, j)
I(i, j) max e J(i, j 1)
d M(i 1, j 1)
e I(i 1, j)
J(i, j) max e J(i, j 1)
d M(i 1, j 1)
Termination

13
Probabilistic interpretation of an alignment

An alignment is a hypothesis that the two
sequences are related by evolution
Goal
Produce the most likely alignment
Assert the likelihood that the sequences are
indeed related

14
A Pair HMM for alignments
BEGIN
?
?
?
1 2? ?
M
I
J
?
?
?
END
15
A Pair HMM for unaligned sequences
Model R
1 - ?
1 - ?
J P(yj)
I P(xi)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?

P(x, y R) ?(1 ?)m P(x1)P(xm) ?(1 ?)n
P(y1)P(yn)
?2(1 ?)mn ?i P(xi) ?j P(yj)

16
To compare ALIGNMENT vs. RANDOM hypothesis
1 2? ?

Every pair of letters contributes
(1 2? ?) P(xi, yj) when matched
? P(xi) P(yj) when gapped
(1 ?)2 P(xi) P(yj) in random model
Focus on comparison of
P(xi, yj) vs. P(xi) P(yj)

M P(xi, yj)
1 2? ?
1 2? ?
?
?
?
?
I P(xi)
J P(yj)
?
?
1 - ?
1 - ?
I P(xi)
J P(yj)
END
BEGIN
END BEGIN
1 - ?
1 - ?
?
?
17
To compare ALIGNMENT vs. RANDOM hypothesis

Idea
We will divide alignment score by the random
score,
and take logarithms
Let
P(xi, yj) (1 2? ?)
s(xi, yj) log log
P(xi) P(yj) (1 ?)2
?(1 2? ?) P(xi)
d log
(1 ?) (1 2? ?) P(xi)
? P(xi)
e log
(1 ?) P(xi)

Every letter b in random model contributes
(1 ?) P(b)
18
The meaning of alignment scores

Because ?, ?, are small, and ?, ? are very small,
P(xi, yj) (1 2? ?)
P(xi, yj)
s(xi, yj) log log ?
log log(1 2?)
P(xi) P(yj) (1 ?)2
P(xi) P(yj)
?(1 ? ?) 1 ?
d log ?
log ?
(1 ?) (1 2? ?) 1 2?
?
e log ? log ?
(1 ?)

19
The meaning of alignment scores

The Viterbi algorithm for Pair HMMs corresponds
exactly to the Needleman-Wunsch algorithm with
affine gaps
However, now we need to score alignment with
parameters that add up to probability
distributions
? 1/mean length of next gap
? 1/mean arrival time of next gap
affine gaps decouple arrival time with length
? 1/mean length of aligned sequences (set to 0)
? 1/mean length of unaligned sequences (set to
0)

20
The meaning of alignment scores

Match/mismatch scores
P(xi, yj)
s(a, b) ? log (lets ignore log(1
2?) for the moment assume no gaps)
P(xi) P(yj)
Example
Say DNA regions between human and mouse have
average conservation of 50
Then P(A,A) P(C,C) P(G,G) P(T,T) 1/8
(so they sum to ½)
P(A,C) P(A,G) P(T,G)
1/24 (24 mismatches, sum to ½)
Say P(A) P(C) P(G) P(T) ¼
log (1/8) / (1/4
1/4) log 2 1, for match
Then, s(a, b) log (1/24) / (1/4 1/4)
log 16/24 -0.585
Cutoff similarity that scores 0 s1 (1
s)0.585 0
According to this model, a 37.5-conserved
sequence with no gaps would score on average

21
Substitution matrices

A more meaningful way to assign match/mismatch
scores
For protein sequences, different substitutions
have dramatically different frequencies!
PAM Matrices
Start from a curated set of very similar protein
sequences
Construct ancestral sequences (using parsimony)
Calculate Aab frequency of letters a and b
interchanging
Calculate Bab P(ba) Aab/(?cd Acd)
Adjust matrix B so that ?a,b qa qb Bab 0.01
PAM(1)
Let PAM(N) PAM(1)N -- Common PAM(250)

22
Substitution Matrices

BLOSUM matrices
Start from BLOCKS database (curated, gap-free
alignments)
Cluster sequences according to gt X identity
Calculate Aab of aligned a-b in distinct
clusters, correcting by 1/mn, where m, n are the
two cluster sizes
Estimate
P(a) (?b Aab)/(?cd Acd) P(a, b) Aab/(?cd
Acd)

23
BLOSUM matrices
BLOSUM 50
BLOSUM 62
(The two are scaled differently)

Write a Comment

User Comments (0)