Protein Multiple Alignment - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Protein Multiple Alignment

Description:

Phylogenetic tree estimation. Secondary structure prediction. Identification of critical regions ... is built by following the branching order of the tree. ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 56

Provided by: Kon105

Category:

more less

Transcript and Presenter's Notes

Title: Protein Multiple Alignment

1
Protein Multiple Alignment

by Konstantin Davydov

2
Papers

MUSCLE a multiple sequence alignment method with
reduced time and space complexity by Robert C
Edgar
ProbCons Probabilistic Consistency-based
Multiple Sequence Alignment by Chuong B. Do,
Mahathi S. P. Mahabhashyam, Michael Brudno, and
Serafim Batzoglou

3
Outline

Introduction
Background
MUSCLE
ProbCons
Conclusion

4
Introduction

What is multiple protein alignment?
Given N sequences of amino acids x1, x2 xN
Insert gaps in each of the xis so that
All sequences have the same length
Score of the global map is maximum

5
Introduction

Motivation
Phylogenetic tree estimation
Secondary structure prediction
Identification of critical regions

6
Outline

Introduction
Background
MUSCLE
ProbCons
Conclusion

7
Background

Aligning two sequences

8
Background
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
9
Background

Unfortunately, this can get very expensive
Aligning N sequences of length L requires a
matrix of size LN, where each square in the
matrix has 2N-1 neighbors
This gives a total time complexity of
O(2N LN)

10
Outline

Introduction
Background
MUSCLE
ProbCons
Conclusion

11
MUSCLE
12
MUSCLE

Basic Strategy A progressive alignment is built,
to which horizontal refinement is applied
Three stages
At end of each stage, a multiple alignment is
available and the algorithm can be terminated

13
Three Stages

Draft Progressive
Improved Progressive
Refinement

14
Stage 1 Draft Progressive

Similarity Measure
Calculated using k-mer counting.

ACCATGCGAATGGTCCACAATG
k-mer ATG CCA
score 3
2
15
Stage 1 Draft Progressive

Distance estimate
Based on the similarities, construct a triangular
distance matrix.

16
Stage 1 Draft Progressive

Tree construction
From the distance matrix we construct a tree

17
Stage 1 Draft Progressive

18
Stage 1 Draft Progressive

Progressive alignment
A progressive alignment is built by following the
branching order of the tree. This yields a
multiple alignment of all input sequences at the
root.

19
Stage 2 Improved Progressive

Attempts to improve the tree and uses it to build
a new progressive alignment. This stage may be
iterated.

20
Stage 2 Improved Progressive

Similarity Measure
Similarity is calculated for each pair of
sequences using fractional identity computed from
their mutual alignment in the current multiple
alignment

21
Stage 2 Improved Progressive

Tree construction
A tree is constructed by computing a Kimura
distance matrix and applying a clustering method
to it

22
Stage 2 Improved Progressive

Tree comparison
The new tree is compared to the previous tree by
identifying the set of internal nodes for which
the branching order has changed

23
Stage 2 Improved Progressive

Progressive alignment
A new progressive alignment is built

24
Stage 3 Refinement

Performs iterative refinement

25
Stage 3 Refinement

Choice of bipartition
An edge is removed from the tree, dividing the
sequences into two disjoint subsets

X1
X3
X2
X4
X5
26
Stage 3 Refinement

Profile Extraction
The multiple alignment of each subset is
extracted from current multiple alignment.
Columns made up of indels only are removed

27
Stage 3 Refinement

Re-alignment
The two profiles are then realigned with each
other using profile-profile alignment.

28
Stage 3 Refinement

Accept/Reject
The score of the new alignment is computed, if
the score is higher than the old alignment, the
new alignment is retained, otherwise it is
discarded.

29
MUSCLE Review

Performance
For alignment of N sequences of length L
Space complexity O(N2L2)
Time complexity O(N4NL2)
Time complexity without refinement O(N3NL2)

30
Outline

Introduction
Background
MUSCLE
ProbCons
Conclusion

31
Hidden Markov Models (HMMs)
X
M
X
AGCC-AGC
Y
-GCCCAGT
Y
IMMMJMMM
J
I
--
--
X
Y
32
Pairwise Alignment

Viterbi Algorithm
Picks the alignment that is most likely to be the
optimal alignment
However the most likely alignment is not the
most accurate
Alternative find the alignment of maximum
expected accuracy

33
Lazy Teacher Analogy

10 students take a 10-question true-false quiz
How do you make the answer key?
Viterbi Approach Use the answer sheet of the
best student
MEA Approach Weighted majority vote

34
Viterbi vs MEA

Viterbi
Picks the alignment with the highest chance of
being completely correct
Maximum Expected Accuracy
Picks the alignment with the highest expected
number of correct predictions

35
ProbCons

Basic Strategy Uses Hidden Markov Models (HMM)
to predict the probability of an alignment.
Uses Maximum Expected Accuracy instead of the
Viterbi alignment.
5 steps

36
Notation

Given N sequences
S s1, s2, sN
a is the optimal alignment

37
ProbCons

Step 1 Computation of posterior-probability
matrices
Step 2 Computation of expected accuracies
Step 3 Probabilistic consistency transformation
Step 4 Computation of guide tree
Step 5 Progressive alignment
Post-processing step Iterative refinement

38
Step 1 Computation of posterior-probability
matrices

For every pair of sequences x,y?S, compute the
matrix Pxy
Pxy(i, j) P(xiyj ?a x, y), which is the
probability that xi and yj are paired in a

39
Step 2 Computation of expected accuracies

For a pairwise alignment a between x and y,
define the accuracy as

40
Step 2 Computation of expected accuracies
(continued)

MEA alignment is found by finding the highest
summing path through the matrix
Mxyi, j P(xi is aligned to yj x, y)

41
Consistency
zk
z
xi
x
y
yj
yj
42
Step 3 Probabilistic consistency transformation

Re-estimate the match quality scores P(xiyj ?a
x, y) by applying the probabilistic consistency
transformation which incorporates similarity of x
and y to other sequences from S into the x-y
comparison

P(xiyj ?a x, y)
P(xiyj ?a x, y, z)
43
Step 3 Probabilistic consistency transformation
(continued)
44
Step 3 Probabilistic consistency transformation
(continued)

Since most of the values of Pxz and Pzy will be
very small, we ignore all the entries in which
the value is smaller than some threshold w.
Use sparse matrix multiplication
May be repeated

45
Step 4 Computation of guide tree

Use E(x,y) as a measure of similarity
Define similarity of two clusters by the
sum-of-pairs

46
Step 5 Progressive alignment

Align sequence groups hierarchically according to
the order specified in the guide tree.
Alignments are scored using sum-of-pairs scoring
function.
Aligned residues are scored according to the
match quality scores P(xiyj ?a x, y)
Gap penalties are set to 0.

47
Post-processing step iterative refinement

Much like in MUSCLE
Randomly partition alignment into two groups of
sequences and realign
May be repeated

48
ProbCons overview

ProbCons demonstrated dramatic improvements in
alignment accuracy
Longer running time
Doesnt use protein-specific alignment
information, so can be used to align DNA
sequences with improved accuracy over the
Needleman-Wunsch algorithm.

49
Outline

Introduction
Background
MUSCLE
ProbCons
Conclusion

50
Conclusion

MUSCLE demonstrated poor accuracy, but very short
running time.
ProbCons demonstrated dramatic improvements in
alignment accuracy, however, is much slower than
MUSCLE.

51
Results
52
Reliability Scores
53
Questions?
54
References

Robert C Edgar
MUSCLE a multiple sequence alignment method with
reduced time and space complexity
Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael
Brudno, and Serafim Batzoglou
ProbCons Probabilistic Consistency-based
Multiple Sequence Alignment

55
References (continued)