A Hidden Markov Model for Progressive Multiple Alignment - PowerPoint PPT Presentation

About This Presentation

Title:

A Hidden Markov Model for Progressive Multiple Alignment

Description:

Pzk=a(xi,yj)=qabsabpb(xi)bsabpb(yj) qa is the character background ... pzk=a(xi,-)=qabsabpb(xi)sa- The same applies for pxi,-. sa- is computed just like sab. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 22

Provided by: sow6

Learn more at: https://www.eecis.udel.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Hidden Markov Model for Progressive Multiple Alignment

1
A Hidden Markov Model for Progressive Multiple
Alignment

Ari Löytynoja and Michel C. Milinkovitch
Appeared in BioInformatics, Vol 19, no.12 , 2003
Presented by Sowmya Venkateswaran
April 20,2006

2
Outline

Motivations
Drawbacks of existing methods
System and Methods
Substitution Model
Hidden Markov Model
Pairwise Alignment using Viterbi Algorithm
Posterior Probability
Multiple Alignment
Results
Discussion

3
Motivation

Progressive alignment techniques are used for
Multiple Sequence Alignment
Used to deduce the phylogeny.
Identify protein families.
Probabilistic methods can be used to estimate the
reliability of global/local alignments.

4
Drawbacks of existing Systems

Iterative application of global/local pairwise
sequence alignment algorithms does not guarantee
a globally optimum alignment.
A best scoring alignment may not correspond with
true alignment. Hence reliability of a
score/alignment needs to be inferred.

5
System and Methods

The idea is to provide a probabilistic framework
for a guide tree and define a vector of
probabilities at each character site.
Guide tree is constructed by using Neighbor
Joining Clustering after producing a distance
matrix. It can also be imported from CLUSTALW.
At each internal node, a probabilistic alignment
is performed. Pointers from parent to child sites
are stored and so also is a vector of
probabilities of the different character states(
A/C/T/G/- for nucleotides or the 20 amino acids
with a gap)

6
Substitution Model

Consider 2 sequences x1n and y1m, whose
alignment we would like to find and their parent
in the guide tree is z1l.
Pa(xi) is the probability that site xi contains
character a.
Pa(xi) 1, if a character a appears at
terminal node, else it is 0.
At internal nodes, different characters have
different probabilities summing to 1.
If the observed character is ambiguous,
probability is shared among different characters.

7
Emission Probabilities

Pxi,yj represents the probability that xi and yj
are aligned.
pxi,yjpzk(xi,yj)?pzka(xi,yj)
Pzka(xi,yj)qa?bsabpb(xi)?bsabpb(yj)
qa is the character background probability
sab, probability of aligning characters a and b,
is calculated with the Jukes Cantor Model
sab1/n (n-1)/n e (n/n-1) v when ab
sab1/n - 1/n e (n/n-1) v when a?b
n is the size of the alphabet ,
v is the NJ-estimated branch length

8
Probabilities

To find pxi,- , the probability that zk evolved
to a character on one of the child sites and a
gap on the other child is
pzka(xi,-)qa?bsabpb(xi)sa-
The same applies for pxi,-. sa- is computed
just like sab.
Any other model can be used for calculation of
sab, instead of the Jukes Cantor Model. Ex PAM
(20 X 20) substitution matrix can be modified to
include gaps and transformed to a (21X21) matrix,
and the substitution probabilities can be derived
from that.

9
Hidden Markov Model
X pxi,-
e
d
1-e
M pxi,yj
1-2d
1-e
d
Y p-,yj
e
10
Hidden Markov Model

d probability of moving to an insert state (gap
opening penalty) lower the value, higher the
penalty.
e probability of staying at an insert state
(gap extension penalty) again, lower the value,
more the extension penalty.
pxi,yj ,pxi,- , p-,yj emission frequencies for
match, insert X and insert Y states.
For testing purposes, d and e were estimated from
pairwise alignments of terminal sequences such
that d1/2(lm1) and e1-1/(lg1) lm and lg are
the mean lengths of match and gap segments.

11
Pairwise Alignment

In this probabilistic model, the best alignment
between 2 sequences corresponds to the Viterbi
path through the HMM.
Since there are 3 states in the model, and each
state needs 2-D space, we have 3 2-D tables vM
for match states, vX and vY for the gap states.
A move within M, X or Y tables produces an
additional match or extends an existing gap. A
move between M table and either X or Y table
closes or opens a gap.

12
Viterbi Recursion

Initialization
v(0,0) 1, v(i,-1) v(-1,j)0
Recursion
vM(i,j) pxi,yj max (1-2d) vM(i-1,j-1),
(1-e) vX(i-1,j-1),
(1-e) vY(i-1,j-1)
vX(i,j) pxi,- max d vM(i-1,j),
e vX(i-1,j)
vY(i,j) p-,yj max d vM(i,j-1),
e vY(i,j-1)
Termination
vEmax(vM(n,m),vX(n,m),vY(n,m))

13
Viterbi traceback

At each cell, the relative probabilities of
entering the different cells are stored. Ex
pM-M (1-2d) vM(i-1,j-1)/N(i,j)
where N(i,j) is the normalizing constant,
given by
N(i,j)(1-2d) vM(i-1,j-1)(1-e)vX(i-1,j-1)
vY(i-1,j-1)
The above equation is calculated for each of the
3 tables
Trace back algorithm used to find the best path
a match step will create pointers from the parent
site to the child sites, and a gap step will
create pointer to one and a gap for the 2nd child
site.

14
Posterior Probabilities-Forward algorithm

Forward algorithm-sum of probabilities of all
paths entering a given cell from the start
position.
Initialization
f(0,0)1f(i,-1)f(-1,j)0
Recursion
i0,,n j0,,m, except (0,0)
fM(i,j) pxi,yj (1-2d) fM(i-1,j-1) (1-e) (
fX(i-1,j-1) fY(i-1,j-1))
fX(i,j) pxi,- d fM(i-1,j) e fX(i-1,j)
fY(i,j) p-,yj d fM(i,j-1) e fY(i,j-1)
Termination
fEfM(n,m)fX(n,m)fY(n,m)

15
Backward algorithm

Sum of probabilities of all possible alignments
between subsequences xin and yjm.
Initialization
b(n,m)1 b(i,m1) f(n1,j) 0
Recursion
in,,1 jm,,1, except (n,m)
bM(i,j) (1-2d) px(i1),y(j1) bM(i1,j1)
d px(i1),- bX(i1,j) p-,y(j1)
bY(i,j1)
bX(i,j) (1-e) px(i1),y(j1) bM(i1,j1) e
px(i1),- bX(i1,j)
bY(i,j) (1-e) px(i1),y(j1) bM(i1,j1) e
p-,y(j1) bX(i1,j)

16
Reliability Check

Assumption Posterior probability of the sites on
the alignment path is a valid estimator of the
local reliability of the alignment since it gives
the proportion of total probability corresponding
to all alignments passing through the cell (i,j).
Posterior probability for a match is given by
P(xi?yjx,y) fM(i,j) bM(i,j) / fE
where fM and bM are the total probabilities of
all possible alignments between subsequences x1i
and y1j and xin and yjm respectively
Similar probabilities are calculated for Insert
X and Insert Y states too.

17
Multiple alignment

Each parent node site has a vector of
probabilities corresponding to each possible
character state (including the gap). For a match,
pa(zk)pzka(xi,yj)/?bpzkb(xi,yj)
Pairwise alignment builds the tree progressively,
from the terminal nodes towards an arbitrary
root.
Once root node is defined, trace-back is done to
find multiple alignment of the nodes below since
each node stores pointers to the matching child
sites.
If a gap occurs in one of the internal nodes, a
gap character state is introduced in all of the
sequences of that sub-tree, and recursive call
will not proceed further in that branch.

18
Testing

Algorithms tested on
(i) simulated nucleotide sequences
50 random data sets generated using the program
Rose. A root random sequence (length 500) was
evolved on a random tree to yield sequences of
low (no. of substitutions per site 0.5) and
high (1.0) divergences. Also, the
insertion/deletion length distribution was set to
short or long.
(ii) Amino acid data sets from Ref1 of the
BAliBASE database. Ref1 contains alignments of
less than 6 equi-distant sequences, i.e., the
percent-identity between 2 sequences is within a
specified range with no large insertion or
deletion. Datasets were divided into 3 groups
based on lengths, and further into 3 based on
similarities.

19
Results of Simulation on Nucleotide Sequences
20
Type1 and Type 2 errors vs. minimum posterior
probability
21
Performance and Future Work

ProAlign performs better than ClustalW for the
nucleotide sequences, but not for amino acid
sequences with sequence identity less than 25.
Possible reasons may be that the model does not
take into account, the protein secondary
structure. So, the HMM can be extended to
modeling protein secondary structure too.
Minimum posterior probability correlates well
with correctness can be used to detect/remove
unreliably aligned regions