Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
2An alignment of heads
3Alignment can be easy or difficult
Easy
Difficult due to insertions or deletions
(indels)
4Homology Definition
- Homology similarity that is the result of
inheritance from a common ancestor -
identification and analysis of homologies is
central to phylogenetic systematics. - An Alignment is an hypothesis of positional
homology between bases/Amino Acids.
5Multiple Sequence Alignment- Goals
- To generate a concise, information-rich summary
of sequence data. - Sometimes used to illustrate the dissimilarity
between a group of sequences. - Alignments can be treated as models that can be
used to test hypotheses. - Does this model of events accurately reflect
known biological evidence.
6(No Transcript)
7(No Transcript)
8Alignment of 16S rRNA can be guided by secondary
structure
Alignment of 16S rRNA sequences from different
bacteria
9Protein Alignment may be guided by Tertiary
Structure Interactions
Homo sapiens DjlA protein
Escherichia coli DjlA protein
10Multiple Sequence Alignment- Methods
- 3 main methods of alignment
- Manual
- Automatic
- Combined
11Manual Alignment - reasons
- Might be carried out because
- Alignment is easy.
- There is some extraneous information
(structural). - Automated alignment methods have encountered the
local minimum problem. - An automated alignment method can be improved.
12Dynamic programming
- 2 methods
- Dynamic programming
- Consider 2 protein sequences of 100 amino acids
in length. - If it takes 1002 seconds to exhaustively align
these sequences, then it will take 1003 seconds
to align 3 sequences, 1004 to align 4
sequences...etc. - More time than the universe has existed to align
20 sequences exhaustively. - Progressive alignment
13Progressive Alignment
- Devised by Feng and Doolittle in 1987.
- Essentially a heuristic method and as such is not
guaranteed to find the optimal alignment. - Requires n-1n-2n-3...n-n1 pairwise alignments
as a starting point - Most successful implementation is Clustal (Des
Higgins)
14Overview of ClustalW Procedure
CLUSTAL W
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Quick pairwise alignment calculate distance
matrix
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
2
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
1
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG
4
2
3
Progressive alignment following guide tree
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
1
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
15ClustalW- Pairwise Alignments
- First perform all possible pairwise alignments
between each pair of sequences. There are
(n-1)(n-2)...(n-n1) possibilities. - Calculate the distance between each pair of
sequences based on these isolated pairwise
alignments. - Generate a distance matrix.
16Path Graph for aligning two sequences.
17Possible alignment
- Scoring Scheme
- Match 1
- Mismatch 0
- Indel -1
1
1
0
1
Score for this path 2
0
-1
18Alignment using this path
1
GATTC- GAATTC
1
0
1
0
-1
19Optimal Alignment 1
Alignment using this path GA-TTC GAATTC
1
1
-1
1
1
Alignment score 4
1
20Optimal Alignment 2
Alignment using this path G-ATTC GAATTC
1
-1
1
1
1
Alignment score 4
1
21ClustalW- Guide Tree
- Generate a Neighbor-Joining guide tree from
these pairwise distances. - This guide tree gives the order in which the
progressive alignment will be carried out.
22Neighbor joining method
- The neighbor joining method is a greedy heuristic
which joins at each step, the two closest
sub-trees that are not already joined. - It is based on the minimum evolution principle.
- One of the important concepts in the NJ method is
neighbors, which are defined as two taxa that are
connected by a single node in an unrooted tree
Node 1
A
B
23Distance Matrix
What is required for the Neighbour joining method?
Distance matrix
24First Step
PAM distance 3.3 (Human - Monkey) is the minimum.
So we'll join Human and Monkey to MonHum and
we'll calculate the new distances.
Mon-Hum
Monkey
Human
Spinach
Mosquito
Rice
25Calculation of New Distances
After we have joined two species in a subtree we
have to compute the distances from every other
node to the new subtree. We do this with a simple
average of distances DistSpinach, MonHum
(DistSpinach, Monkey DistSpinach, Human)/2
(90.8 86.3)/2 88.55
Mon-Hum
Monkey
Human
Spinach
26Next Cycle
Mos-(Mon-Hum)
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
27Penultimate Cycle
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
28Last Joining
(Spin-Rice)-(Mos-(Mon-Hum))
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
29Unrooted Neighbor-Joining Tree
Human
Spinach
Monkey
Mosquito
Rice
30Multiple Alignment- First pair
- Align the two most closely-related sequences
first. - This alignment is then fixed and will never
change. If a gap is to be introduced
subsequently, then it will be introduced in the
same place in both sequences, but their relative
alignment remains unchanged.
31ClustalW- Decision time
- Consult the guide tree to see what alignment is
performed next. - Align a third sequence to the first two
- Or
- Align two entirely different sequences to each
other.
Option 1
Option 2
32ClustalW- Alternative 1
If the situation arises where a third sequence is
aligned to the first two, then when a gap has to
be introduced to improve the alignment, each of
these two entities are treated as two single
sequences.
33ClustalW- Alternative 2
- If, on the other hand, two separate sequences
have to be aligned together, then the first
pairwise alignment is placed to one side and the
pairwise alignment of the other two is carried
out.
34ClustalW- Progression
- The alignment is progressively built up in this
way, with each step being treated as a pairwise
alignment, sometimes with each member of a pair
having more than one sequence.
35ClustalW-Good points/Bad points
- Advantages
- Speed.
- Disadvantages
- No objective function.
- No way of quantifying whether or not the
alignment is good - No way of knowing if the alignment is correct.
36ClustalW-Local Minimum
- Potential problems
- Local minimum problem. If an error is introduced
early in the alignment process, it is impossible
to correct this later in the procedure. - Arbitrary alignment.
37Increasing the sophistiaction of the alignment
process.
- Should we treat all the sequences in the same
way? - even though some sequences are
closely-related and some sequences are distant
relatives. - Should we treat all positions in the sequences as
though they were the same? - even though they
might have different functions and different
locations in the 3-dimensional structure.
38(No Transcript)
39ClustalW- Caveats
- Sequence weighting
- Varying substitution matrices
- Residue-specific gap penalties and reduced
penalties in hydrophilic regions (external
regions of protein sequences), encourage gaps in
loops rather than in core regions. - Positions in early alignments where gaps have
been opened receive locally reduced gap penalties
to encourage openings in subsequent alignments
40ClustalW- User-supplied values
- Two penalties are set by the user (there are
default values, but you should know that it is
possible to change these). - GOP- Gap Opening Penalty is the cost of opening a
gap in an alignment. - GEP- Gap Extension Penalty is the cost of
extending this gap.
41Position-Specific gap penalties
- Before any pair of (groups of) sequences are
aligned, a table of GOPs are generated for each
position in the two (sets of) sequences. - The GOP is manipulated in a position-specific
manner, so that it can vary over the sequences. - If there is a gap at a position, the GOP and GEP
penalties are lowered, the other rules do not
apply. - This makes gaps more likely at positions where
gaps already exist.
42Discouraging too many gaps
- If there is no gap opened, then the GOP is
increased if the position is within 8 residues of
an existing gap. - This discourages gaps that are too close
together. - At any position within a run of hydrophilic
residues, the GOP is decreased. - These runs usually indicate loop regions in
protein structures. - A run of 5 hydrophilic residues is considered to
be a hydrophilic stretch. - The default hydrophilic residues are
- D, E, G, K, N, Q, P, R, S
- But this can be changed by the user.
43Divergent Sequences
- The most divergent sequences (most different, on
average from all of the other sequences) are
usually the most difficult to align. - It is sometimes better to delay their aligment
until later (when the easier sequences have
already been aligned). - The user has the choice of setting a cutoff
(default is 40 identity). - This will delay the alignment until the others
have been aligned.
44Advice on progressive alignment
- Progressive alignment is a mathematical process
that is completely independent of biological
reality. - Can be a very good estimate
- Can be an impossibly poor estimate.
- Requires user input and skill.
- Treat cautiously
- Can be improved by eye (usually)
- Often helps to have colour-coding.
- Depending on the use, the user should be able to
make a judgement on those regions that are
reliable or not. - For phylogeny reconstruction, only use those
positions whose hypothesis of positional homology
is unimpeachable
45Alignment of protein-coding DNA sequences
- It is not very sensible to align the DNA
sequences of protein-coding genes.
ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGT-AGGG
The result might be highly-implausible and might
not reflect what is known about biological
processes. It is much more sensible to translate
the sequences to their corresponding amino acid
sequences, align these protein sequences and then
put the gaps in the DNA sequences according to
where they are found in the amino acid alignment.
46Manual Alignment- software
- GDE- The Genetic Data Environment (UNIX)
- CINEMA- Java applet available from
- http//www.biochem.ucl.ac.uk
- Seqapp/Seqpup- Mac/PC/UNIX available from
- http//iubio.bio.indiana.edu
- SeAl for Macintosh, available from
- http//evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
- BioEdit for PC, available from
- http//www.mbio.ncsu.edu/RNaseP/info/programs/BIOE
DIT/bioedit.html