Title: Multiple Alignment
1Multiple Alignment
2Pairwise Dynamic Programming (DP)
Time O(L2), memory O(L2)
3Three-sequence DP
Time O(L3), memory O(L3)
4Multidimensional DP
- Time O(LN), memory O(LN)
- Generally impractical, e.g. for globins (99aa)
kkilobytes
Mmegabytes
Ggigabytes
Tterabytes
Ppetabytes
Eexabytes
5Progressive alignment
- Estimate a guide tree (slowest step - why?)
- Proceed up tree, building a profile for each
internal node - Align siblings, going from leaves to root
- Sequence-to-sequence (A-B, D-E)
- Sequence-to-profile (U-C)
- Profile-to-profile (V-W)
6Whats a profile?
Alignment
Profile
7Profile a.k.a. Position-specific Weight
Matrix (PWM)
8Sequence logos
Scale each column by its entropy (technically,
the difference between its entropy and the
maximum possible entropy)
weblogo.berkeley.edu
9Sequence logos
Globin, B helix to D helix
10Scoring schemes
- Scoring schemes up to now have been pairwise
- Several ways of scoring a multiple alignment
column - Entropy
- Sum-of-pairs
- Phylogenetic
- Position-specific
- Sequence-profile and profile-profile scoring
11Entropy-like scores
- If n(x) is the number of times residue x occurs
in the column, then p(x) n(x) / N - Rewards homogeneous columns
- Assumes each row is an independent draw from the
same probability distribution - Equivalent to the following
can maximize with Lagrange multipliers
12Sum-of-pairs score
- i and j are row indices
- xi is residue in row i (similarly xj)
- Q(a,b) is pairwise substitution matrix
- Problems overcounting
13Probabilistic scoring
- Recall pairwise substitution matrix is
Q(a,b)log q(a,b)where Q is an additive score
q is a multiplicative probability - Strictly, q is usually not a probability per se,
but is related to one e.g. it might be a
likelihood ratioq(a,b) P(a,b) / (P(a)P(b))
P(ba) / P(a) P(ab) / P(b)
14Phylogenetic score
(but... you dont actually know u,v,w,x. So what
do you do? The probabilistic answer sum them out)
...and then rearrange the sums optimally
(pruning)...
15Position-specific score
- Score for aligning two (or more) residues does
not depend (directly) on their values - Instead, you specify particular scores for
aligning each pair of positions - These can be obtained by pre-processing the
sequences (e.g. scores derived from posterior
probabilities from a Pair HMM), or by other means - e.g. T-COFFEE, PROBCONS
16Profile-sequence
- Q(a,b)log q(a,b) is pairwise substitution matrix
- Aligning profile p(x) with residue y
17Profile-profile
- Q(a,b)log q(a,b) is pairwise substitution matrix
- Aligning profile p1(x) with profile p2(y)