Protein Multiple Alignment - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Protein Multiple Alignment

Description:

Phylogenetic tree estimation. Secondary structure prediction. Identification of critical regions ... is built by following the branching order of the tree. ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 56
Provided by: Kon105
Category:

less

Transcript and Presenter's Notes

Title: Protein Multiple Alignment


1
Protein Multiple Alignment
  • by Konstantin Davydov

2
Papers
  • MUSCLE a multiple sequence alignment method with
    reduced time and space complexity by Robert C
    Edgar
  • ProbCons Probabilistic Consistency-based
    Multiple Sequence Alignment by Chuong B. Do,
    Mahathi S. P. Mahabhashyam, Michael Brudno, and
    Serafim Batzoglou

3
Outline
  • Introduction
  • Background
  • MUSCLE
  • ProbCons
  • Conclusion

4
Introduction
  • What is multiple protein alignment?
  • Given N sequences of amino acids x1, x2 xN
  • Insert gaps in each of the xis so that
  • All sequences have the same length
  • Score of the global map is maximum

5
Introduction
  • Motivation
  • Phylogenetic tree estimation
  • Secondary structure prediction
  • Identification of critical regions

6
Outline
  • Introduction
  • Background
  • MUSCLE
  • ProbCons
  • Conclusion

7
Background
  • Aligning two sequences

8
Background
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
9
Background
  • Unfortunately, this can get very expensive
  • Aligning N sequences of length L requires a
    matrix of size LN, where each square in the
    matrix has 2N-1 neighbors
  • This gives a total time complexity of
  • O(2N LN)

10
Outline
  • Introduction
  • Background
  • MUSCLE
  • ProbCons
  • Conclusion

11
MUSCLE
12
MUSCLE
  • Basic Strategy A progressive alignment is built,
    to which horizontal refinement is applied
  • Three stages
  • At end of each stage, a multiple alignment is
    available and the algorithm can be terminated

13
Three Stages
  • Draft Progressive
  • Improved Progressive
  • Refinement

14
Stage 1 Draft Progressive
  • Similarity Measure
  • Calculated using k-mer counting.

ACCATGCGAATGGTCCACAATG
k-mer ATG CCA
score 3
2
15
Stage 1 Draft Progressive
  • Distance estimate
  • Based on the similarities, construct a triangular
    distance matrix.

16
Stage 1 Draft Progressive
  • Tree construction
  • From the distance matrix we construct a tree

17
Stage 1 Draft Progressive

18
Stage 1 Draft Progressive
  • Progressive alignment
  • A progressive alignment is built by following the
    branching order of the tree. This yields a
    multiple alignment of all input sequences at the
    root.

19
Stage 2 Improved Progressive
  • Attempts to improve the tree and uses it to build
    a new progressive alignment. This stage may be
    iterated.

20
Stage 2 Improved Progressive
  • Similarity Measure
  • Similarity is calculated for each pair of
    sequences using fractional identity computed from
    their mutual alignment in the current multiple
    alignment

21
Stage 2 Improved Progressive
  • Tree construction
  • A tree is constructed by computing a Kimura
    distance matrix and applying a clustering method
    to it

22
Stage 2 Improved Progressive
  • Tree comparison
  • The new tree is compared to the previous tree by
    identifying the set of internal nodes for which
    the branching order has changed

23
Stage 2 Improved Progressive
  • Progressive alignment
  • A new progressive alignment is built

24
Stage 3 Refinement
  • Performs iterative refinement

25
Stage 3 Refinement
  • Choice of bipartition
  • An edge is removed from the tree, dividing the
    sequences into two disjoint subsets

X1
X3
X2
X4
X5
26
Stage 3 Refinement
  • Profile Extraction
  • The multiple alignment of each subset is
    extracted from current multiple alignment.
    Columns made up of indels only are removed

27
Stage 3 Refinement
  • Re-alignment
  • The two profiles are then realigned with each
    other using profile-profile alignment.

28
Stage 3 Refinement
  • Accept/Reject
  • The score of the new alignment is computed, if
    the score is higher than the old alignment, the
    new alignment is retained, otherwise it is
    discarded.

29
MUSCLE Review
  • Performance
  • For alignment of N sequences of length L
  • Space complexity O(N2L2)
  • Time complexity O(N4NL2)
  • Time complexity without refinement O(N3NL2)

30
Outline
  • Introduction
  • Background
  • MUSCLE
  • ProbCons
  • Conclusion

31
Hidden Markov Models (HMMs)
X
M
X
AGCC-AGC
Y
-GCCCAGT
Y
IMMMJMMM
J
I
--
--
X
Y
32
Pairwise Alignment
  • Viterbi Algorithm
  • Picks the alignment that is most likely to be the
    optimal alignment
  • However the most likely alignment is not the
    most accurate
  • Alternative find the alignment of maximum
    expected accuracy

33
Lazy Teacher Analogy
  • 10 students take a 10-question true-false quiz
  • How do you make the answer key?
  • Viterbi Approach Use the answer sheet of the
    best student
  • MEA Approach Weighted majority vote

34
Viterbi vs MEA
  • Viterbi
  • Picks the alignment with the highest chance of
    being completely correct
  • Maximum Expected Accuracy
  • Picks the alignment with the highest expected
    number of correct predictions

35
ProbCons
  • Basic Strategy Uses Hidden Markov Models (HMM)
    to predict the probability of an alignment.
  • Uses Maximum Expected Accuracy instead of the
    Viterbi alignment.
  • 5 steps

36
Notation
  • Given N sequences
  • S s1, s2, sN
  • a is the optimal alignment

37
ProbCons
  • Step 1 Computation of posterior-probability
    matrices
  • Step 2 Computation of expected accuracies
  • Step 3 Probabilistic consistency transformation
  • Step 4 Computation of guide tree
  • Step 5 Progressive alignment
  • Post-processing step Iterative refinement

38
Step 1 Computation of posterior-probability
matrices
  • For every pair of sequences x,y?S, compute the
    matrix Pxy
  • Pxy(i, j) P(xiyj ?a x, y), which is the
    probability that xi and yj are paired in a

39
Step 2 Computation of expected accuracies
  • For a pairwise alignment a between x and y,
    define the accuracy as

40
Step 2 Computation of expected accuracies
(continued)
  • MEA alignment is found by finding the highest
    summing path through the matrix
  • Mxyi, j P(xi is aligned to yj x, y)

41
Consistency
zk
z
xi
x
y
yj
yj
42
Step 3 Probabilistic consistency transformation
  • Re-estimate the match quality scores P(xiyj ?a
    x, y) by applying the probabilistic consistency
    transformation which incorporates similarity of x
    and y to other sequences from S into the x-y
    comparison

P(xiyj ?a x, y)
P(xiyj ?a x, y, z)
43
Step 3 Probabilistic consistency transformation
(continued)
44
Step 3 Probabilistic consistency transformation
(continued)
  • Since most of the values of Pxz and Pzy will be
    very small, we ignore all the entries in which
    the value is smaller than some threshold w.
  • Use sparse matrix multiplication
  • May be repeated

45
Step 4 Computation of guide tree
  • Use E(x,y) as a measure of similarity
  • Define similarity of two clusters by the
    sum-of-pairs

46
Step 5 Progressive alignment
  • Align sequence groups hierarchically according to
    the order specified in the guide tree.
  • Alignments are scored using sum-of-pairs scoring
    function.
  • Aligned residues are scored according to the
    match quality scores P(xiyj ?a x, y)
  • Gap penalties are set to 0.

47
Post-processing step iterative refinement
  • Much like in MUSCLE
  • Randomly partition alignment into two groups of
    sequences and realign
  • May be repeated

48
ProbCons overview
  • ProbCons demonstrated dramatic improvements in
    alignment accuracy
  • Longer running time
  • Doesnt use protein-specific alignment
    information, so can be used to align DNA
    sequences with improved accuracy over the
    Needleman-Wunsch algorithm.

49
Outline
  • Introduction
  • Background
  • MUSCLE
  • ProbCons
  • Conclusion

50
Conclusion
  • MUSCLE demonstrated poor accuracy, but very short
    running time.
  • ProbCons demonstrated dramatic improvements in
    alignment accuracy, however, is much slower than
    MUSCLE.

51
Results
52
Reliability Scores
53
Questions?
54
References
  • Robert C Edgar
  • MUSCLE a multiple sequence alignment method with
    reduced time and space complexity
  • Chuong B. Do, Mahathi S. P. Mahabhashyam, Michael
    Brudno, and Serafim Batzoglou
  • ProbCons Probabilistic Consistency-based
    Multiple Sequence Alignment

55
References (continued)
  • Slides on Multiple Sequence Alignment, CS262
  • Slides on Sequence similarity, CS273
  • Slides on Protein Multiple Alignment, Marina
    Sirota
Write a Comment
User Comments (0)
About PowerShow.com