Title: Protein Multiple Alignment
1Protein Multiple Alignment
2Papers
- MUSCLE a multiple sequence alignment method with
reduced time and space complexity by Robert C
Edgar - ProbCons Probabilistic Consistency-based
Multiple Sequence Alignment by Chuong B. Do,
Mahathi S. P. Mahabhashyam, Michael Brudno, and
Serafim Batzoglou
3Outline
- Introduction
- Background
- MUSCLE
- ProbCons
- Conclusion
4Introduction
- What is multiple protein alignment?
- Given N sequences of amino acids x1, x2 xN
- Insert gaps in each of the xis so that
- All sequences have the same length
- Score of the global map is maximum
5Introduction
- Motivation
- Phylogenetic tree estimation
- Secondary structure prediction
- Identification of critical regions
-
6Outline
- Introduction
- Background
- MUSCLE
- ProbCons
- Conclusion
7Background
8Background
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
9Background
- Unfortunately, this can get very expensive
- Aligning N sequences of length L requires a
matrix of size LN, where each square in the
matrix has 2N-1 neighbors - This gives a total time complexity of
- O(2N LN)
10Outline
- Introduction
- Background
- MUSCLE
- ProbCons
- Conclusion
11MUSCLE
12MUSCLE
- Basic Strategy A progressive alignment is built,
to which horizontal refinement is applied - Three stages
- At end of each stage, a multiple alignment is
available and the algorithm can be terminated
13Three Stages
- Draft Progressive
- Improved Progressive
- Refinement
14Stage 1 Draft Progressive
- Similarity Measure
- Calculated using k-mer counting.
ACCATGCGAATGGTCCACAATG
k-mer ATG CCA
score 3
2
15Stage 1 Draft Progressive
- Distance estimate
- Based on the similarities, construct a triangular
distance matrix.
16Stage 1 Draft Progressive
- Tree construction
- From the distance matrix we construct a tree
17Stage 1 Draft Progressive
18Stage 1 Draft Progressive
- Progressive alignment
- A progressive alignment is built by following the
branching order of the tree. This yields a
multiple alignment of all input sequences at the
root.
19Stage 2 Improved Progressive
- Attempts to improve the tree and uses it to build
a new progressive alignment. This stage may be
iterated.
20Stage 2 Improved Progressive
- Similarity Measure
- Similarity is calculated for each pair of
sequences using fractional identity computed from
their mutual alignment in the current multiple
alignment
21Stage 2 Improved Progressive
- Tree construction
- A tree is constructed by computing a Kimura
distance matrix and applying a clustering method
to it
22Stage 2 Improved Progressive
- Tree comparison
- The new tree is compared to the previous tree by
identifying the set of internal nodes for which
the branching order has changed
23Stage 2 Improved Progressive
- Progressive alignment
- A new progressive alignment is built
24Stage 3 Refinement
- Performs iterative refinement
25Stage 3 Refinement
- Choice of bipartition
- An edge is removed from the tree, dividing the
sequences into two disjoint subsets
X1
X3
X2
X4
X5
26Stage 3 Refinement
- Profile Extraction
- The multiple alignment of each subset is
extracted from current multiple alignment.
Columns made up of indels only are removed
27Stage 3 Refinement
- Re-alignment
- The two profiles are then realigned with each
other using profile-profile alignment.
28Stage 3 Refinement
- Accept/Reject
- The score of the new alignment is computed, if
the score is higher than the old alignment, the
new alignment is retained, otherwise it is
discarded.
29MUSCLE Review
- Performance
- For alignment of N sequences of length L
- Space complexity O(N2L2)
- Time complexity O(N4NL2)
- Time complexity without refinement O(N3NL2)
30Outline
- Introduction
- Background
- MUSCLE
- ProbCons
- Conclusion
31Hidden Markov Models (HMMs)
X
M
X
AGCC-AGC
Y
-GCCCAGT
Y
IMMMJMMM
J
I
--
--
X
Y
32Pairwise Alignment
- Viterbi Algorithm
- Picks the alignment that is most likely to be the
optimal alignment - However the most likely alignment is not the
most accurate - Alternative find the alignment of maximum
expected accuracy
33Lazy Teacher Analogy
- 10 students take a 10-question true-false quiz
- How do you make the answer key?
- Viterbi Approach Use the answer sheet of the
best student - MEA Approach Weighted majority vote
34Viterbi vs MEA
- Viterbi
- Picks the alignment with the highest chance of
being completely correct - Maximum Expected Accuracy
- Picks the alignment with the highest expected
number of correct predictions
35ProbCons
- Basic Strategy Uses Hidden Markov Models (HMM)
to predict the probability of an alignment. - Uses Maximum Expected Accuracy instead of the
Viterbi alignment. - 5 steps
36Notation
- Given N sequences
- S s1, s2, sN
- a is the optimal alignment
37ProbCons
- Step 1 Computation of posterior-probability
matrices - Step 2 Computation of expected accuracies
- Step 3 Probabilistic consistency transformation
- Step 4 Computation of guide tree
- Step 5 Progressive alignment
- Post-processing step Iterative refinement
38Step 1 Computation of posterior-probability
matrices
- For every pair of sequences x,y?S, compute the
matrix Pxy - Pxy(i, j) P(xiyj ?a x, y), which is the
probability that xi and yj are paired in a
39Step 2 Computation of expected accuracies
- For a pairwise alignment a between x and y,
define the accuracy as
40Step 2 Computation of expected accuracies
(continued)
- MEA alignment is found by finding the highest
summing path through the matrix - Mxyi, j P(xi is aligned to yj x, y)
41Consistency
zk
z
xi
x
y
yj
yj
42Step 3 Probabilistic consistency transformation
- Re-estimate the match quality scores P(xiyj ?a
x, y) by applying the probabilistic consistency
transformation which incorporates similarity of x
and y to other sequences from S into the x-y
comparison
P(xiyj ?a x, y)
P(xiyj ?a x, y, z)
43Step 3 Probabilistic consistency transformation
(continued)
44Step 3 Probabilistic consistency transformation
(continued)
- Since most of the values of Pxz and Pzy will be
very small, we ignore all the entries in which
the value is smaller than some threshold w. - Use sparse matrix multiplication
- May be repeated
45Step 4 Computation of guide tree
- Use E(x,y) as a measure of similarity
- Define similarity of two clusters by the
sum-of-pairs
46Step 5 Progressive alignment
- Align sequence groups hierarchically according to
the order specified in the guide tree. - Alignments are scored using sum-of-pairs scoring
function. - Aligned residues are scored according to the
match quality scores P(xiyj ?a x, y) - Gap penalties are set to 0.
47Post-processing step iterative refinement
- Much like in MUSCLE
- Randomly partition alignment into two groups of
sequences and realign - May be repeated
48ProbCons overview
- ProbCons demonstrated dramatic improvements in
alignment accuracy - Longer running time
- Doesnt use protein-specific alignment
information, so can be used to align DNA
sequences with improved accuracy over the
Needleman-Wunsch algorithm.
49Outline
- Introduction
- Background
- MUSCLE
- ProbCons
- Conclusion
50Conclusion
- MUSCLE demonstrated poor accuracy, but very short
running time. - ProbCons demonstrated dramatic improvements in
alignment accuracy, however, is much slower than
MUSCLE.
51Results
52Reliability Scores
53Questions?
54References
- Robert C Edgar
- MUSCLE a multiple sequence alignment method with
reduced time and space complexity - Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael
Brudno, and Serafim Batzoglou - ProbCons Probabilistic Consistency-based
Multiple Sequence Alignment
55References (continued)
- Slides on Multiple Sequence Alignment, CS262
- Slides on Sequence similarity, CS273
- Slides on Protein Multiple Alignment, Marina
Sirota