Title: Multiple sequence alignment based on Larry Hunters Slides
1Multiple sequence alignment(based on Larry
Hunters Slides)
- Generalize our pairwise alignment of sequences to
include more than two homologous proteins. - Looking at more than two sequences gives us much
more information - Which amino acids are required? correlated?
- Evolutionary/phylogenetic relationships
2Phylogenetic Trees
Analysis of 20 samples of Cytochrome c protein
Sequences Numbers represent nucleotide
substitutions in the gene for Cytochrome c
3Sample MSA
FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPA
DSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFC
ADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSA
SPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF
60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YL
SSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-C
AGLGEMPGSF 54 .. . .
. ... .. .. ... FOS_RAT
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHP
YGLPTPS-TGAYARAGVV 112 FOS_MOUSE
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-
AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQP
TLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL
112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQG
QPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS
----YSTPGMS 110
... . .. ..
FOS_RAT KTMSGGRAQSIG--------------------
RRGKVEQLSPEEEEKRRIRRERNKMAAA 152 FOS_MOUSE
KTVSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRI
RRERNKMAAA 152 FOS_CHICK KAP-GGRGQSIG-------
-------------RRGKVEQLSPEEEEKRRIRRERNKMAAA
151 FOSB_MOUSE AYSTGGASGSGGPSTSTTTSGPVSARPARA
RPRRPREETLTPEEEEKRRVRRERNKLAAA 170 FOSB_HUMAN
GYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRR
VRRERNKLAAA 170 . .
.. . . .
4Optimal MSA
- Use Dynamic Programming?
- Optimal alignment algorithm exists, but is
O(2nln) where n is the number of sequences and l
is the length of the longest sequence. - 10 sequences of length 100 take 210100101023
operations, around 1 million years at 3GHz - Exponential algorithms strike again.
- So, approximation approaches?
5Progressive MSA
- Start with pairwise alignments of closely related
sequences, and then add more distantly related
sequences one at a time. - Requires information (assumptions) about the
phylogenetic relationship a priori. - Can be estimated from all pairwise comparisons.
- Give total MSA score based on sum of pairwise
scores - Perhaps weighted to reduce the influence of very
similar sequences.
6Gaps in Progressive MSAs
- How to score gaps?
- Want to align gaps with each other over all
sequences. A gap in a pairwise alignment that
matches a gap in another pairwise alignment
should cost less than introducing a totally new
gap. - Possible that a new gap could be made to match
an older one by shifting around the original
pairwise alignment, but at great computational
cost. - Change gap penalty near conserved domains of
various kinds (e.g. secondary structure,
hydrophobic regions) - CLUSTALW http//www.ebi.ac.uk/clustalw
7Greedy algorithms
- Progressive MSA programs make the best alignment
of a new sequence with the existing ones they can
at the time, and then never revisit the decision. - Even if changing an old decision (e.g. the gaps)
could increase the score, this approach doesn't. - Approach is called greedy (because it takes the
best first), and is a common way to resolve
exponential problems.
8Problems with progressive MSA
- Depends crucially on the quality of the pairwise
alignments, particularly among the closest
matches. - No suitable resolution to the problem of gap
penalties over multiple sequences. - Works reasonably well for closely related
sequences. Even then, manual adjustments are
common.
9Iterative MSA methods
- The idea here is to start with a reasonable
approximation to the optimal MSA (e.g. by using a
progressive method) and then tweaking to
improve it. - Various optimization techniques have been tried
here (e.g. GAs and simulated annealing). - Key is the scoring function for the whole MSA.
- Also, what steps to take that are likely to
improve the score.
10Block based methods
- Another approach to iterative methods are to
start with short local alignments (sometimes
called blocks) and then to reduce the problem to
aligning the regions between the blocks - Divide and conquer is another common CS
approach to exponential problems. - How to find the blocks?
- DALIGN (local alignment methods)
- DCA (divide and conquer alignments)
- Tmsa (identify patterns and use them to define
blocks).
11Databases of MSAs
- Once they have been calculated, they can be saved
and shared - Pfam database of protein families. Alignments of
large numbers of homologous proteins. - http//www.sanger.ac.uk/Software/Pfam/index.shtml
- TigerFam database of protein families curated
for function, rather than homology - http//www.tigr.org/TIGRFAMs/index.shtml
12More web sites
- Web sites offer multiple approaches to MSA.
- Interfaces to multiple different programs
- http//searchlauncher.bcm.tmc.edu/multi-align
- http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
Ali - Main web-based MSA servers
- http//www.ebi.ac.uk/clustalw
- http//baboon.math.berkeley.edu/mavid/ (genomic
seqs) - See course website for many more listings
13Protein motifs
- Recall that local alignments can identify similar
regions in non-homologous proteins - These regions (sometimes called domains) often
have shared structure and/or function. - Example Zinc-finger DNA binding motif
14Zinc-finger DNA binding motif
15Protein motifs
- How to define them?
- Consensus sequence
- Regular expression
- Profile (probability for each amino acid at each
position)
16ProSite consensus sequences
17Recognizing ProSite patterns
- L14 Ribosome pattern GA-LIV(3)-x(9,10)-DNS
-G-x(4)-FY-x(2)-NT-x(2)-V-LIV - Some matching sequences
- GIIIACGHLIPQTNGACRTYILNDRVV
- GVLLWQPKHCSNAADGAWAWFAATAAVL
- ALIVEANIIILSISGRATTFHATSAVI
- ProSite patterns can be translated into regular
expressions, although the bounded length patterns
(e.g. LIV(3,5) are unwieldy to write down as
regexps.
18Example of ProSite
- AC-x-V-x(4)-EDThis pattern is translated as
Ala or Cys-any-Val-any-any-any-any-any but Glu
or Asplt A-x-ST(2)-x(0,1)-VThis pattern,
which must be in the N-terminal of the sequence
(lt'), is translated as Ala-any-Ser or
Thr-Ser or Thr-(any or none)-ValltCgtThis
pattern describes all sequences which do not
contain any Cysteines.IIRIFHLRNIThis pattern
describes all sequences which contain the
subsequence 'IIRIFHLRNI'.
19Regular expressions
- Wide use in computer science. Basis of PERL
language (see also BioPERL).For proteins,
a language like prosite patterns is more
intuitive, but often equivalent.
20Profiles
- Rather than identifying only the consensus
(i.e. most common) amino acid at a particular
location, we can assign a probability to each
amino acid in each position of the domain. - Example
1 2 3 A .1 .5 .25C .3 .1 .25 D .2
.2 .25E .4 .2 .25
21Applying a profile
- Calculate score (probability of match) for a
profile at each position in a sequence by
multiplying individual probabilities. Sliding
window - Can transform probability to significance given
random distribution assumption
22Applying a profile
- Calculate score (probability of match) for a
profile at each position in a sequence by
multiplying individual probabilities. Sliding
window - Can transform probability to significance given
random distribution assumption
23Using motifs
- Great for annotating a sequence with no strong
homologs. - INTERPRO is an uniform interface to many
different motif methods and databases - ProSite
- Prints (fingerprints multiple motifs)
- ProDom (like Pfam, but for domains)
- SMART (mobile domains)
24Interpro example
25InterPro example (con't).
- Then, match the pattern to a protein database
26How do we create motifs?
- General problem of inducing patterns from
sequences is difficult - Classic language result (Gold) Context-free
grammars can not be induced from only positive
examples - Many patterns are compatible with any MSA. How
to decide which constituents are required? - In general case, we need positive examples (in
the class) but also near misses sequences that
are similar but not members of the class. - Not absolutely true for protein sequences.
27Finding Consensus Sequences
- Based on local MSAs.
- ProSite consensus built from MSA on (Amos
Bairoch's) biological intuition, tweaked by
calculating sensitivity and specificity of the
patterns over SwissProt. - True (False) positives defined by Bairoch's
understanding. - Not an automatable procedure!
28Creating profiles
- Given a local MSA, creating a profile is
straightforward. - Calculate frequency of each amino acid at each
position to create profile. - What to do about zero frequencies?
- Could be sampling errors, not real zero
probabilities. - Zero probabilities always make zero scores!
- Regularization
- pseudocounts
- Dirichlet mixtures (blend in background
frequencies)
29Profile example
- MSA Counts Add 1
pseudocount - Profiles
1 2 3 A 2 0 1B 1 4 1 C 1 0
1 D 0 0 1
1 2 3 A 3 1 2B 2 5 2 C 2 1
2 D 1 1 2
BBB ABC ABD CBA
1 2 3 A .5 0 .25B .25 1 .25 C .25 0
.25 D 0 0 .25
1 2 3 A .37 .12 .25B .25 .63 .25 C .25
.12 .25 D .12 .12 .25
30Feature alphabets
- Amino acids can be grouped by their
characteristics - Size, hydrophobicity, ionizability, etc.
- An amino acid is generally in more than one group
- Can set different regularizers (pseudocounts) for
each different feature - Most useful when there are multiple features
(otherwise many amino acids get same pseudocount)