Title: Multiple Sequence Alignment
1Chapter 5
Multiple Sequence Alignment
2- Multiple alignment is an extension of pairwise
alignment where multiple sequences are aligned - This alignment provides insights not possible in
pairwise alignments, such as - Conserved sequence patterns
- Conserved and functionally critical amino acid
residues - Prerequisite for phylogenetic analyses
- Prediction of protein secondary and tertiary
structures - Design of degenerate PCR primers
3Scoring Function
- The purpose of multiple alignment is to line up
sequences in a way so that a maximum number of
residues from each sequence are matched according
to a scoring function - The scoring function is generally based on sum
of pairs (SP) - The SP is the sum of all pairwise scores for all
residues in the alignment
Sequence 1 G K N Sequence 2 T R N Sequence
3 S H E GT 1 KR2 NN6 TS 1 RH0
NE0 GS 0 KH-1 NE0 Total2 1
6 9
Blosum62 substitution matrix
Thus 29 512 times more likely than by random
chance
4Exhaustive Algorithms
Brute Force Algorithm Similar to dynamic
programming algorithms that searches for the best
solution, examining every possible solution In
pairwise alignment use a 2D matrix For N
sequences, use an N-dimensional matrix Number of
calculations increase exponentially
(NNNN) Generally only useful for lt10 short
sequences Divide and Conquer Alignment
(DCA) Identify regional similarities in multiple
sequences Do a brute force alignment of the
similar regions Join the independently aligned
regions http//bibiserv.techfak.uni-bielefeld.de/d
ca/
5(No Transcript)
6Heuristic Algorithm
Progressive Alignment Method
- Pairwise alignment by Needleman-Wunsch of all
pairs - Records similarity scores of aligned pairs
- Scores entered into matrix
- Guide tree constructed that reflects similarity
between aligned pairs - Most closely related sequences re-aligned with
Needleman-Wunsch - Different substitution matrices are selected
depending on evolutionary distance between
sequences to be aligned - Aligned pair converted to consensus sequence
with fixed gaps - Consensus sequences treated as ordinary sequence
for next step which is pairwise alignment with
most related sequence in guide tree - Next consensus sequence is calculated and
process repeated until all sequences are aligned - Most famous clustalW (command line) clustalX
(GUI) - http//www.ebi.ac.uk/Tools/clustalw2/index.html
7Download and install clustW from ftp//ftp.ebi.ac
.uk/pub/software/clustalw2/2.0.9/ Spend a few
minutes entering sequences and doing alignments
8- ClustalW uses gap penalties that is context
sensitive - Gaps count more close to runs of hydrophobic
amino acids (more likely to be in internal
conserved regions of a protein) compared to next
to hydrophilic regions or G, likely to be on the
outside in loops - Weighing scheme closely related sequences are
gived a lower weighting score - The weighting score is dependent upon the branch
length divided by the number of shared branches - This has the effect of minimizing a possible
dominating effect of common sequences
9Drawbacks and Solutions
- Based on global alignment thus only sequences
of similar length can be aligned - Long gaps required for alignment of dissimilar
sequence length penalized - Greedy algorithm once gaps are introduced,
they stay in subsequence consensus sequences
10T-Coffee
- Tree-based Consistency Objective Function for
alignment Evaluation - http//www.ebi.ac.uk/Tools/t-coffee/
- http//tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee
_cgi/index.cgi - Performs global alignment with clustal
- Local pairwise alignment with Lalign
- Global and ten best local alignments are pooled
to form a library - All pairwise alignments are then aligned with a
third possible sequence - Distance matrix calculated to build a guide tree
- Guide tree used for final multiple alignment
- Does not get stuck in sub-optimal initial
alignments - Slower than clustal
11dbClustal
- First performs BLASTP search for a query sequence
- Aligned pairs are analyzed to obtain anchor
points (local conserved regions) using a program
called Ballast - Global alignment generated by Clustal, weighed to
anchor points - Initial local alignment minimizes errors in
divergent sequences - Multiple alignment subsequently evaluated by
NorMD which removes poorly aligned sequences - http//bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbC
lustalnoid
12Partial Order Alignment (POA)
- http//bioinformatics.ucla.edu/poa/
- Multiple alignments performed on more and more
sequences from a list - Identical residues condensed to nodes
- Each new sequence aligned with each sequence of
the graph model - Eliminates the problem of error fixation
- Faster and more accurate than clustal
13PRALINE
- http//zeus.cs.vu.nl/programs/pralinewww/
- Builds profiles of sequences to be aligned
- Profiles generated by PSI-BLAST
- Because profiles contain information on close
relatives, divergent sequences are more
accurately aligned - Program can incorporate secondary protein
structure - Very sophisticated but very slow
14Iterative Alignment
- PRRN
- Find optimal solution by iteratively modifying
sub-optimal solutions - http//prrn.ims.u-tokyo.ac.jp/
- Multiple alignment is performed on whole group of
sequences - Sequences randomly distributed into two groups
- Dynamic programming applied to consensus
sequences derived from each group - The random split is repeated and another round of
dynamic programming alignment performed - This is repeated until the alignment score no
longer increases - A multiple alignment of the sequences are then
again performed - Process repeated until multiple alignment score
no longer improves
15Iterative Alignment
- DIALIGN2
- http//mobyle.pasteur.fr/cgi-bin/MobylePortal/port
al.py?formdialign - Breaks all sequences down into segments, and
performs alignment between segments - High-scoring segments are progressively assembled
into larger and larger sequences - The score of an alignment is calculated from the
block and not from individual residues - Sequence regions between block are left unaligned
- Very suited to alignment of divergent sequences
16Practical Issues
- DNA alignments are only based on 4 nucleotides,
and are less reliable than protein sequence
alignments - Alignments of DNA sequence does not consider
functional issues, suchas gene boundaries - Insertion of gaps may break codons or cause
frameshift that will not be tolerated in the
protein, and is functional nonsense - Thus, always better toalign protein sequences
- Possible to convert DNA to amino acid sequence,
then align, and then decode back to DNA - RevTrans (http//www.cbs.dtu.dk/services/RevTrans/
) - PROTA2DNA (missing link)
17Editing and Format
- Most alignment programs require final editing by
a human to ensure that there are no problems in
functionality - Finding badly aligned regions
- Removing non-sensical gaps etc.
- http//www.mbio.ncsu.edu/bioEdit/bioedit.html
- Need to convert one sequence format to another
http//iubio.bio.indiana.edu/cgi-bin/readseq.cgi/