Multiple sequence alignment based on Larry Hunters Slides presentation

About This Presentation

Transcript and Presenter's Notes

Title: Multiple sequence alignment based on Larry Hunters Slides

1
Multiple sequence alignment(based on Larry
Hunters Slides)

Generalize our pairwise alignment of sequences to
include more than two homologous proteins.
Looking at more than two sequences gives us much
more information
Which amino acids are required? correlated?
Evolutionary/phylogenetic relationships

2
Phylogenetic Trees
Analysis of 20 samples of Cytochrome c protein
Sequences Numbers represent nucleotide
substitutions in the gene for Cytochrome c
3
Sample MSA
FOS_RAT MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPA
DSFSSMGSPVNTQDFCADLSVSSANF 60 FOS_MOUSE
MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFC
ADLSVSSANF 60 FOS_CHICK MMYQGFAGEYEAPSSRCSSA
SPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANF
60 FOSB_MOUSE -MFQAFPGDYDS-GSRCSS-SPSAESQ--YL
SSVDSFGSPPTAAASQE-CAGLGEMPGSF 54 FOSB_HUMAN
-MFQAFPGDYDS-GSRCSS-SPSAESQ--YLSSVDSFGSPPTAAASQE-C
AGLGEMPGSF 54 .. . .
. ... .. .. ... FOS_RAT
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHP
YGLPTPS-TGAYARAGVV 112 FOS_MOUSE
IPTVTAISTSPDLQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-
AGAYARAGMV 112 FOS_CHICK VPTVTAISTSPDLQWLVQP
TLISSVAPSQ-------NRG-HPYGVPAPAPPAAYSRPAVL
112 FOSB_MOUSE VPTVTAITTSQDLQWLVQPTLISSMAQSQG
QPLASQPPAVDPYDMPGTS----YSTPGLS 110 FOSB_HUMAN
VPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTS
----YSTPGMS 110
... . .. ..
FOS_RAT KTMSGGRAQSIG--------------------
RRGKVEQLSPEEEEKRRIRRERNKMAAA 152 FOS_MOUSE
KTVSGGRAQSIG--------------------RRGKVEQLSPEEEEKRRI
RRERNKMAAA 152 FOS_CHICK KAP-GGRGQSIG-------
-------------RRGKVEQLSPEEEEKRRIRRERNKMAAA
151 FOSB_MOUSE AYSTGGASGSGGPSTSTTTSGPVSARPARA
RPRRPREETLTPEEEEKRRVRRERNKLAAA 170 FOSB_HUMAN
GYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRR
VRRERNKLAAA 170 . .
.. . . .

4
Optimal MSA

Use Dynamic Programming?
Optimal alignment algorithm exists, but is
O(2nln) where n is the number of sequences and l
is the length of the longest sequence.
10 sequences of length 100 take 210100101023
operations, around 1 million years at 3GHz
Exponential algorithms strike again.
So, approximation approaches?

5
Progressive MSA

Start with pairwise alignments of closely related
sequences, and then add more distantly related
sequences one at a time.
Requires information (assumptions) about the
phylogenetic relationship a priori.
Can be estimated from all pairwise comparisons.
Give total MSA score based on sum of pairwise
scores
Perhaps weighted to reduce the influence of very
similar sequences.

6
Gaps in Progressive MSAs

How to score gaps?
Want to align gaps with each other over all
sequences. A gap in a pairwise alignment that
matches a gap in another pairwise alignment
should cost less than introducing a totally new
gap.
Possible that a new gap could be made to match
an older one by shifting around the original
pairwise alignment, but at great computational
cost.
Change gap penalty near conserved domains of
various kinds (e.g. secondary structure,
hydrophobic regions)
CLUSTALW http//www.ebi.ac.uk/clustalw

7
Greedy algorithms

Progressive MSA programs make the best alignment
of a new sequence with the existing ones they can
at the time, and then never revisit the decision.
Even if changing an old decision (e.g. the gaps)
could increase the score, this approach doesn't.
Approach is called greedy (because it takes the
best first), and is a common way to resolve
exponential problems.

8
Problems with progressive MSA

Depends crucially on the quality of the pairwise
alignments, particularly among the closest
matches.
No suitable resolution to the problem of gap
penalties over multiple sequences.
Works reasonably well for closely related
sequences. Even then, manual adjustments are
common.

9
Iterative MSA methods

The idea here is to start with a reasonable
approximation to the optimal MSA (e.g. by using a
progressive method) and then tweaking to
improve it.
Various optimization techniques have been tried
here (e.g. GAs and simulated annealing).
Key is the scoring function for the whole MSA.
Also, what steps to take that are likely to
improve the score.

10
Block based methods

Another approach to iterative methods are to
start with short local alignments (sometimes
called blocks) and then to reduce the problem to
aligning the regions between the blocks
Divide and conquer is another common CS
approach to exponential problems.
How to find the blocks?
DALIGN (local alignment methods)
DCA (divide and conquer alignments)
Tmsa (identify patterns and use them to define
blocks).

11
Databases of MSAs

Once they have been calculated, they can be saved
and shared
Pfam database of protein families. Alignments of
large numbers of homologous proteins.
http//www.sanger.ac.uk/Software/Pfam/index.shtml
TigerFam database of protein families curated
for function, rather than homology
http//www.tigr.org/TIGRFAMs/index.shtml

12
More web sites

Web sites offer multiple approaches to MSA.
Interfaces to multiple different programs
http//searchlauncher.bcm.tmc.edu/multi-align
http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
Ali
Main web-based MSA servers
http//www.ebi.ac.uk/clustalw
http//baboon.math.berkeley.edu/mavid/ (genomic
seqs)
See course website for many more listings

13
Protein motifs

Recall that local alignments can identify similar
regions in non-homologous proteins
These regions (sometimes called domains) often
have shared structure and/or function.
Example Zinc-finger DNA binding motif

14
Zinc-finger DNA binding motif
15
Protein motifs

How to define them?
Consensus sequence
Regular expression
Profile (probability for each amino acid at each
position)

16
ProSite consensus sequences
17
Recognizing ProSite patterns

L14 Ribosome pattern GA-LIV(3)-x(9,10)-DNS
-G-x(4)-FY-x(2)-NT-x(2)-V-LIV
Some matching sequences
GIIIACGHLIPQTNGACRTYILNDRVV
GVLLWQPKHCSNAADGAWAWFAATAAVL
ALIVEANIIILSISGRATTFHATSAVI
ProSite patterns can be translated into regular
expressions, although the bounded length patterns
(e.g. LIV(3,5) are unwieldy to write down as
regexps.

18
Example of ProSite

AC-x-V-x(4)-EDThis pattern is translated as
Ala or Cys-any-Val-any-any-any-any-any but Glu
or Asplt A-x-ST(2)-x(0,1)-VThis pattern,
which must be in the N-terminal of the sequence
(lt'), is translated as Ala-any-Ser or
Thr-Ser or Thr-(any or none)-ValltCgtThis
pattern describes all sequences which do not
contain any Cysteines.IIRIFHLRNIThis pattern
describes all sequences which contain the
subsequence 'IIRIFHLRNI'.

19
Regular expressions

Wide use in computer science. Basis of PERL
language (see also BioPERL).For proteins,
a language like prosite patterns is more
intuitive, but often equivalent.

20
Profiles

Rather than identifying only the consensus
(i.e. most common) amino acid at a particular
location, we can assign a probability to each
amino acid in each position of the domain.
Example

1 2 3 A .1 .5 .25C .3 .1 .25 D .2
.2 .25E .4 .2 .25
21
Applying a profile

Calculate score (probability of match) for a
profile at each position in a sequence by
multiplying individual probabilities. Sliding
window
Can transform probability to significance given
random distribution assumption

22
Applying a profile

Calculate score (probability of match) for a
profile at each position in a sequence by
multiplying individual probabilities. Sliding
window
Can transform probability to significance given
random distribution assumption

23
Using motifs

Great for annotating a sequence with no strong
homologs.
INTERPRO is an uniform interface to many
different motif methods and databases
ProSite
Prints (fingerprints multiple motifs)
ProDom (like Pfam, but for domains)
SMART (mobile domains)

24
Interpro example
25
InterPro example (con't).

Then, match the pattern to a protein database

26
How do we create motifs?

General problem of inducing patterns from
sequences is difficult
Classic language result (Gold) Context-free
grammars can not be induced from only positive
examples
Many patterns are compatible with any MSA. How
to decide which constituents are required?
In general case, we need positive examples (in
the class) but also near misses sequences that
are similar but not members of the class.
Not absolutely true for protein sequences.

27
Finding Consensus Sequences

Based on local MSAs.
ProSite consensus built from MSA on (Amos
Bairoch's) biological intuition, tweaked by
calculating sensitivity and specificity of the
patterns over SwissProt.
True (False) positives defined by Bairoch's
understanding.
Not an automatable procedure!

28
Creating profiles

Given a local MSA, creating a profile is
straightforward.
Calculate frequency of each amino acid at each
position to create profile.
What to do about zero frequencies?
Could be sampling errors, not real zero
probabilities.
Zero probabilities always make zero scores!
Regularization
pseudocounts
Dirichlet mixtures (blend in background
frequencies)

29
Profile example

MSA Counts Add 1
pseudocount
Profiles

1 2 3 A 2 0 1B 1 4 1 C 1 0
1 D 0 0 1
1 2 3 A 3 1 2B 2 5 2 C 2 1
2 D 1 1 2
BBB ABC ABD CBA
1 2 3 A .5 0 .25B .25 1 .25 C .25 0
.25 D 0 0 .25
1 2 3 A .37 .12 .25B .25 .63 .25 C .25
.12 .25 D .12 .12 .25
30
Feature alphabets

Amino acids can be grouped by their
characteristics
Size, hydrophobicity, ionizability, etc.
An amino acid is generally in more than one group
Can set different regularizers (pseudocounts) for
each different feature
Most useful when there are multiple features
(otherwise many amino acids get same pseudocount)

Write a Comment

User Comments (0)

About PowerShow.com

Multiple sequence alignment based on Larry Hunters Slides PowerPoint PPT Presentation