Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Multiple Sequence Alignment

Description:

To generate a concise, information-rich summary of sequence data. ... More time than the universe has existed to align 20 sequences exhaustively. ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 47
Provided by: marti289
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
2
An alignment of heads
3
Alignment can be easy or difficult
Easy
Difficult due to insertions or deletions
(indels)
4
Homology Definition
  • Homology similarity that is the result of
    inheritance from a common ancestor -
    identification and analysis of homologies is
    central to phylogenetic systematics.
  • An Alignment is an hypothesis of positional
    homology between bases/Amino Acids.

5
Multiple Sequence Alignment- Goals
  • To generate a concise, information-rich summary
    of sequence data.
  • Sometimes used to illustrate the dissimilarity
    between a group of sequences.
  • Alignments can be treated as models that can be
    used to test hypotheses.
  • Does this model of events accurately reflect
    known biological evidence.

6
(No Transcript)
7
(No Transcript)
8
Alignment of 16S rRNA can be guided by secondary
structure
Alignment of 16S rRNA sequences from different
bacteria
9
Protein Alignment may be guided by Tertiary
Structure Interactions
Homo sapiens DjlA protein
Escherichia coli DjlA protein
10
Multiple Sequence Alignment- Methods
  • 3 main methods of alignment
  • Manual
  • Automatic
  • Combined

11
Manual Alignment - reasons
  • Might be carried out because
  • Alignment is easy.
  • There is some extraneous information
    (structural).
  • Automated alignment methods have encountered the
    local minimum problem.
  • An automated alignment method can be improved.

12
Dynamic programming
  • 2 methods
  • Dynamic programming
  • Consider 2 protein sequences of 100 amino acids
    in length.
  • If it takes 1002 seconds to exhaustively align
    these sequences, then it will take 1003 seconds
    to align 3 sequences, 1004 to align 4
    sequences...etc.
  • More time than the universe has existed to align
    20 sequences exhaustively.
  • Progressive alignment

13
Progressive Alignment
  • Devised by Feng and Doolittle in 1987.
  • Essentially a heuristic method and as such is not
    guaranteed to find the optimal alignment.
  • Requires n-1n-2n-3...n-n1 pairwise alignments
    as a starting point
  • Most successful implementation is Clustal (Des
    Higgins)

14
Overview of ClustalW Procedure
CLUSTAL W
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Quick pairwise alignment calculate distance
matrix
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
2
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
1
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG
4
2
3
Progressive alignment following guide tree
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
1
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
15
ClustalW- Pairwise Alignments
  • First perform all possible pairwise alignments
    between each pair of sequences. There are
    (n-1)(n-2)...(n-n1) possibilities.
  • Calculate the distance between each pair of
    sequences based on these isolated pairwise
    alignments.
  • Generate a distance matrix.

16
Path Graph for aligning two sequences.
17
Possible alignment
  • Scoring Scheme
  • Match 1
  • Mismatch 0
  • Indel -1

1
1
0
1
Score for this path 2
0
-1
18
Alignment using this path
1
GATTC- GAATTC
1
0
1
0
-1
19
Optimal Alignment 1
Alignment using this path GA-TTC GAATTC
1
1
-1
1
1
Alignment score 4
1
20
Optimal Alignment 2
Alignment using this path G-ATTC GAATTC
1
-1
1
1
1
Alignment score 4
1
21
ClustalW- Guide Tree
  • Generate a Neighbor-Joining guide tree from
    these pairwise distances.
  • This guide tree gives the order in which the
    progressive alignment will be carried out.

22
Neighbor joining method
  • The neighbor joining method is a greedy heuristic
    which joins at each step, the two closest
    sub-trees that are not already joined.
  • It is based on the minimum evolution principle.
  • One of the important concepts in the NJ method is
    neighbors, which are defined as two taxa that are
    connected by a single node in an unrooted tree

Node 1
A
B
23
Distance Matrix
What is required for the Neighbour joining method?
Distance matrix
24
First Step
PAM distance 3.3 (Human - Monkey) is the minimum.
So we'll join Human and Monkey to MonHum and
we'll calculate the new distances.
Mon-Hum
Monkey
Human
Spinach
Mosquito
Rice
25
Calculation of New Distances
After we have joined two species in a subtree we
have to compute the distances from every other
node to the new subtree. We do this with a simple
average of distances DistSpinach, MonHum
(DistSpinach, Monkey DistSpinach, Human)/2
(90.8 86.3)/2 88.55
Mon-Hum
Monkey
Human
Spinach
26
Next Cycle
Mos-(Mon-Hum)
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
27
Penultimate Cycle
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
28
Last Joining
(Spin-Rice)-(Mos-(Mon-Hum))
Mos-(Mon-Hum)
Spin-Rice
Mon-Hum
Human
Mosquito
Monkey
Spinach
Rice
29
Unrooted Neighbor-Joining Tree
Human
Spinach
Monkey
Mosquito
Rice
30
Multiple Alignment- First pair
  • Align the two most closely-related sequences
    first.
  • This alignment is then fixed and will never
    change. If a gap is to be introduced
    subsequently, then it will be introduced in the
    same place in both sequences, but their relative
    alignment remains unchanged.

31
ClustalW- Decision time
  • Consult the guide tree to see what alignment is
    performed next.
  • Align a third sequence to the first two
  • Or
  • Align two entirely different sequences to each
    other.

Option 1
Option 2
32
ClustalW- Alternative 1
If the situation arises where a third sequence is
aligned to the first two, then when a gap has to
be introduced to improve the alignment, each of
these two entities are treated as two single
sequences.

33
ClustalW- Alternative 2
  • If, on the other hand, two separate sequences
    have to be aligned together, then the first
    pairwise alignment is placed to one side and the
    pairwise alignment of the other two is carried
    out.


34
ClustalW- Progression
  • The alignment is progressively built up in this
    way, with each step being treated as a pairwise
    alignment, sometimes with each member of a pair
    having more than one sequence.

35
ClustalW-Good points/Bad points
  • Advantages
  • Speed.
  • Disadvantages
  • No objective function.
  • No way of quantifying whether or not the
    alignment is good
  • No way of knowing if the alignment is correct.

36
ClustalW-Local Minimum
  • Potential problems
  • Local minimum problem. If an error is introduced
    early in the alignment process, it is impossible
    to correct this later in the procedure.
  • Arbitrary alignment.

37
Increasing the sophistiaction of the alignment
process.
  • Should we treat all the sequences in the same
    way? - even though some sequences are
    closely-related and some sequences are distant
    relatives.
  • Should we treat all positions in the sequences as
    though they were the same? - even though they
    might have different functions and different
    locations in the 3-dimensional structure.

38
(No Transcript)
39
ClustalW- Caveats
  • Sequence weighting
  • Varying substitution matrices
  • Residue-specific gap penalties and reduced
    penalties in hydrophilic regions (external
    regions of protein sequences), encourage gaps in
    loops rather than in core regions.
  • Positions in early alignments where gaps have
    been opened receive locally reduced gap penalties
    to encourage openings in subsequent alignments

40
ClustalW- User-supplied values
  • Two penalties are set by the user (there are
    default values, but you should know that it is
    possible to change these).
  • GOP- Gap Opening Penalty is the cost of opening a
    gap in an alignment.
  • GEP- Gap Extension Penalty is the cost of
    extending this gap.

41
Position-Specific gap penalties
  • Before any pair of (groups of) sequences are
    aligned, a table of GOPs are generated for each
    position in the two (sets of) sequences.
  • The GOP is manipulated in a position-specific
    manner, so that it can vary over the sequences.
  • If there is a gap at a position, the GOP and GEP
    penalties are lowered, the other rules do not
    apply.
  • This makes gaps more likely at positions where
    gaps already exist.

42
Discouraging too many gaps
  • If there is no gap opened, then the GOP is
    increased if the position is within 8 residues of
    an existing gap.
  • This discourages gaps that are too close
    together.
  • At any position within a run of hydrophilic
    residues, the GOP is decreased.
  • These runs usually indicate loop regions in
    protein structures.
  • A run of 5 hydrophilic residues is considered to
    be a hydrophilic stretch.
  • The default hydrophilic residues are
  • D, E, G, K, N, Q, P, R, S
  • But this can be changed by the user.

43
Divergent Sequences
  • The most divergent sequences (most different, on
    average from all of the other sequences) are
    usually the most difficult to align.
  • It is sometimes better to delay their aligment
    until later (when the easier sequences have
    already been aligned).
  • The user has the choice of setting a cutoff
    (default is 40 identity).
  • This will delay the alignment until the others
    have been aligned.

44
Advice on progressive alignment
  • Progressive alignment is a mathematical process
    that is completely independent of biological
    reality.
  • Can be a very good estimate
  • Can be an impossibly poor estimate.
  • Requires user input and skill.
  • Treat cautiously
  • Can be improved by eye (usually)
  • Often helps to have colour-coding.
  • Depending on the use, the user should be able to
    make a judgement on those regions that are
    reliable or not.
  • For phylogeny reconstruction, only use those
    positions whose hypothesis of positional homology
    is unimpeachable

45
Alignment of protein-coding DNA sequences
  • It is not very sensible to align the DNA
    sequences of protein-coding genes.

ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGT-AGGG
The result might be highly-implausible and might
not reflect what is known about biological
processes. It is much more sensible to translate
the sequences to their corresponding amino acid
sequences, align these protein sequences and then
put the gaps in the DNA sequences according to
where they are found in the amino acid alignment.
46
Manual Alignment- software
  • GDE- The Genetic Data Environment (UNIX)
  • CINEMA- Java applet available from
  • http//www.biochem.ucl.ac.uk
  • Seqapp/Seqpup- Mac/PC/UNIX available from
  • http//iubio.bio.indiana.edu
  • SeAl for Macintosh, available from
  • http//evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
  • BioEdit for PC, available from
  • http//www.mbio.ncsu.edu/RNaseP/info/programs/BIOE
    DIT/bioedit.html
Write a Comment
User Comments (0)
About PowerShow.com