Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
Julie Thompson Laboratory of Integrative
Bioinformatics and Genomics IGBMC, Strasbourg,
France julie_at_igbmc.fr
2Multiple Sequence Alignment
- Introduction what is a multiple alignment?
- Multiple alignment construction
- Traditional approaches optimal, progressive
- Alignment parameters
- Iterative and co-operative approaches
- Multiple alignment analysis
- Quality analysis/error detection
- Conserved/homologous regions
- Multiple alignment applications
3What is a multiple alignment?
- a representation of a set of sequences, where
equivalent residues (e.g. functional, structural)
are aligned in rows or more usually columns
Example part of an alignment of SH2 domains from
14 sequences
conserved identical residues conserved
similar residues
4What is a multiple alignment?
conserved residues
secondary structure
conservation profile
5Multiple Sequence Alignment
- Introduction what is a multiple alignment?
- Multiple alignment construction
- Traditional approaches optimal, progressive
- Alignment parameters
- Iterative and co-operative approaches
- Multiple alignment analysis
- Quality analysis/error detection
- Conserved/homologous regions
- Multiple alignment applications
6Multiple Alignment Construction
- Optimal multiple alignment
- example MSA (Lipman et al. 1989, Gupta et al.
1995)
7Optimal multiple alignment
Extension of dynamic programming for 2 sequences
gt N dimensions
Example alignment of 3 sequences
Problem calculation time and memory
requirements Time proportional to Nk for k
sequences of length N gt limited to less than 10
sequences
Alignment of 5 sulfate binding proteins, length
224-263 residues MSA OMA ClustalW gt12hours 6
2.9min 0.6sec
8Multiple Alignment Construction
- Optimal multiple alignment
- MSA, OMA
- Progressive multiple alignment
- ClustalW (Thompson et al. NAR. 1994)
- ClustalX (Thompson et al. NAR. 1997)
9Progressive multiple alignment
Idea Progressively align pairs of sequences
(or groups of sequences)
10Progressive multiple alignment
1) Pairwise alignments of all sequences
The alignment can be obtained by - local or
global method - dynamic programming or heuristic
method (eg. K-tuple count)
Hbb_human 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
. . . Hba_human 2
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.
...
Ex local pairwise alignments of globin sequences
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
Hbb_horse 1
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF.DLSH ...
. . . Hbb_horse 3
LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
11Progressive multiple alignment
Example in ClustalW/X distance between 2
sequences 1-
2) Construction of a distance matrix
No. identical residues
No. aligned residues
- .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 -
.81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 -
1
Ex 7 globin sequences
2
3
4
5
6
7
12Progressive multiple alignment
- Sequential branching
- Construction of a guide tree
- - Neigbor-Joining (NJ)
- - UPGMA
- - Maximum likelihood
3) Decide order of alignment
13Progressive multiple alignment
4) Progressive multiple alignment
The sequences are aligned progressively (global
or local algorithm) - alignment of 2
sequences - alignment of 1 sequence and a
profile (group of sequences) - alignment of 2
profiles (groups of sequences)
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
14Progressive multiple alignment
H1
H3
H2
H4
H6
H7
H5
15Progressive multiple alignment
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
SB - sequential branching
UPGMA- Unweighted Pair Grouping Method ML -
maximum likelihood NJ - neighbor-joining
16Alignment parameters similarity matrices
Dynamic programming methods score an alignment
using residue similarity matrices, containing a
score for matching all pairs of residues
For nucleotide sequences
Transitions (A-G or C-T) are more frequent than
transversions (A-T or C-G)
More complex matrices exist where matches between
ambiguous nucleotides are given values whenever
there is any overlap in the sets of nucleotides
represented
17Alignment parameters similarity matrices
For proteins, a wide variety of matrices
exist Identity, PAM, Blosum, Gonnet etc.
Matrices are generally constructed by observing
the mutations in large sets of alignments, either
sequence-based or structure-based
Matrices range from strict ones for comparing
closely related sequences to soft ones for very
divergent sequences.
e.g. PAM250 corresponds to an evolutionary
distance of 250, or approximately 80 residue
divergence PAM1 corresponds to less than
1 divergence
18Alignment parameters similarity matrices
A single best matrix does not exist!
- Altschul, 1991 suggests PAM250 for related
sequences, PAM120 when the sequences are not
known to be related and PAM40 to search for short
segments of highly similar sequences. - Henikoff, Henikoff, 1993 suggest Blosum62 as a
good all-round matrix, Blosum45 for more
divergent sequences and Blosum100 for strongly
related sequences - ClustalW automatically selects a suitable
matrix depending on the observed pairwise
identity - By default ID gt35 Gonnet 80
- 35gtID gt25 Gonnet 250
- lt25ID Gonnet 350
19Alignment parameters gap penalties
- A gap penalty is a cost for introducing gaps into
the alignment, corresponding to insertions or
deletions in the sequences
SFGDLSNPGAVMG HF-DLS-----HG
- proportional gap costs charge a fixed penalty for
each residue aligned with a gap - the cost of a
gap is proportional to its length
GAP_COSTuk where k is the length of gap
- linear or affine gap costs define a cost for
introducing or opening a gap, plus a
length-dependent extension cost
GAP_COSTvuk where v is the gap opening cost,
u is the
gap extension cost
20Alignment parameters gap penalties
- ClustalW uses position-specific gap penalties to
make gaps more or less likely at different
positions in the alignment
- Gap penalties are lowered at existing gaps and
increased near to existing gaps - Gap penalties are lowered in hydrophilic
stretches - Otherwise, gap opening penalties are modified
according to their observed relative frequencies
adjacent to gaps (Pascarella Argos, 1992)
Goal is to introduce gaps in sequence segments
corresponding to flexible regions of the protein
structure
21Multiple Alignment Construction
- Optimal multiple alignment
- MSA, OMA
- Progressive multiple alignment
- ClustalW, ClustalX
- Iterative multiple alignment
- PRRP (Gotoh, 1993)
- SAGA (Notredame et al. NAR. 1996)
- DIALIGN (Morgenstern et al. 1999)
- HMMER (Eddy 1998), SAM (Karplus et al. 2001)
22Iterative refinement
PRRP (Gotoh, 1993) refines an initial progressive
multiple alignment by iteratively dividing the
alignment into 2 profiles and realigning them.
divide sequences into 2 groups
pairwise profile alignment
profile 1
refined alignment
initial alignment
Global progressif
profile 2
no
23Genetic Algorithms
SAGA (Notredame et al.1996) evolves a population
of alignments in a quasi evolutionary manner,
iteratively improving the fitness of the
population
24Segment-to-segment alignment
Dialign (Morgenstern et al. 1996) compares
segments of sequences instead of single residues
1. construct dot-plots of all possible pairs of
sequences
Sequence i
Sequence j
2. find a maximal set of consistent diagonals in
all the sequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq........
.......WWNAedsegkr.GMIPVPYVek.......... ........nl
FVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCE
Aqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpq
edgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..G
MFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKG
DEYFIleesnlp...............WWRArdkngqe.GYIPSNYVtea
eds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaid
eg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg........
.......WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyq
YRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLN
GynettgerGDFPGTYVeyigrkkisp..
3. Local alignment - residues between the
diagonals are not aligned
25Multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
prrp
Genetic Algo.
HMM
dialign
saga
hmmt
Iterative
26League Table based on BAliBASE benchmark database
Comparison of programs
Reference 1 lt 6 sequences
Reference 5 long insertions
Reference 3 several sub-families
Reference 4 long N/C terminal extensions
Reference 2 a family with an orphan
lt 100 résidues
gt 400 résidues
Tous
All
N
/
A
N
/
A
N
/
A
N
/
A
GLOBAL
iterative
N
/
A
N
/
A
LOCAL
iterative
- Iterative algorithms can improve alignment
quality, but can be slow
- Global algorithms work well when sequences are
homologous over their full lengths, local
algorithms are better for non-colinear sequences
Thompson et al. 1999
27Multiple Alignment Construction
- Optimal multiple alignment
- MSA, OMA
- Progressive multiple alignment
- ClustalW, ClustalX
- Iterative multiple alignment
- PRRP, SAGA, DIALIGN, HMMER, SAM
- Co-operative multiple alignment
- T-COFFEE (Notredame et al. 2000)
http//igs-server.cnrs-mrs.fr/Tcoffee/ - DbClustal (Thompson et al. 2000)
http//www-igbmc.u-strasbg.fr/BioInfo/ - MAFFT (Katoh et al. 2002) http//www.biophys.kyoto
-u.ac.jp/katoh/programs/align/mafft/ - MUSCLE (Edgar, 2004) http//www.drive5.com/muscle
- Probcons (Do et al. 2005)
- Kalign (Lassmann et al. 2005)
28DbClustal
http//bips.u-strasbg.fr/PipeAlign/
Blast Database Search
Query Sequence
Database Hits
Domain A
Domain B
Domain C
29Comparaison ClustalW / DbClustal
ClustalW
DbClustal
30MAFFT
- Local homologous segments detected using a Fast
Fourier Transform - Pairwise alignments are performed using
restricted global dynamic programming - Multiple alignment is built up using a
progressive algorithm, similar to ClustalW - Multiple alignment is then iteratively refined
by dividing alignment into 2 parts and realigning
31MAFFT
Pairwise alignments
c(k)
k
2
-1
1. Fast Fourier Transform to detect local
conserved segments
2. Segment Level Dynamic Programming to select
consistent segments
3. Fix residues at the centre of each segment
pair and realign between fixed points (white
regions only)
32State-of-the-art
- Co-operative algorithms have led to significant
improvements
Ref 11 lt20 ID
BAliBASE 3
Ref 12 20-40 ID
Ref 5 insertions
Ref 2 orphan
Ref 4 extensions
Ref 3 sub-families
but none of the methods currently available
are capable of producing high-quality alignments
for all test cases
Thompson et al. 2005, 2006
33RNA alignment methods
- Comparison using BRAliBASE RNA structure
alignments (Gardner et al, 2005)
- Above 60 identity, sequence and structure based
approaches have similar scores - Algorithms incorporating structural information
outperform pure sequence methods. However, these
algorithms are computationally demanding which
severely limits their use in practice.
- Some more recent methods
- Sequence R-Coffee (Wilm, 2008), MAFFT (Katoh,
2008) - Structure LARA (Bauer, 2007), FoldalignM
(Torarinsson, 2007), SCARNA (Tabei, 2008)
34DNA alignment methods
- Complete genomes
- Local alignments (BlastZ, MultiZ, MUMmer,)
- Global alignments (MGA, Multi-LAGAN, MAVID,
MAUVE, MAP2, Mulan,)
Reviewed in Dewey and Pachter, Human Molecular
Genetics, 2006
35Multiple Sequence Alignment
- Introduction what is a multiple alignment?
- Multiple alignment construction
- Traditional approaches optimal, progressive
- Alignment parameters
- Iterative and co-operative approaches
- Multiple alignment analysis
- Quality analysis/error detection
- Conserved/homologous regions
- Multiple alignment applications
36Multiple alignment analysis
- Are the sequences correctly aligned?
- Quality analysis alignment objective functions
(SP, NorMD) - error detection and correction (RASCAL, Refiner)
- Are the sequences in the alignment homologous?
- Conserved/homologous regions (MCOFFEE, LEON)
- Conserved (functional) residues
37Objective functions
Sum-of-pairs (Carrillo, Lipman, 1988) Sum of
scores for all pairs of sequences
Blosum62 N C N 6 -3 C -3 9
Seq1-2 3 pairs N-N 3x618
Sequence 1 N N N Sequence 2 N N
N Sequence 3 N N C Sequence 4 N C C
Seq1-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq1-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq2-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq2-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq3-4 1 pair N-N, 1 pair N-C, 1 pair
CC 6(-3)912
48
- Information content (Hertz et al, 1999)
- Entropy column scores (between 0 and 1), sum for
all columns in the alignment
- norMD (Thompson et al, 2001)
- Column scores
- normalisation for sequence set to be aligned
(number, length, similarity) - lt0.3 bad alignment
- 0.3-0.7 some local errors
- gt0.7 good alignment
38Objective functions NorMD
Window length 8
Window length 40
39Error detection and correction
- RASCAL (Thompson et al, 2003), Refiner
(Chakrabati et al, 2006)
RASCAL
40Error detection and correction
- RASCAL, errors within core blocks
metalloprotease
41Error detection and correction
- RASCAL, errors between core blocks
methyltransferase
42Homology detection methods
- Sequence percent identity
- gt30 identity ? sequences are homologous
- 15-30 identity ? twilight zone
- local analysis of positional conservation
- AL2CO (Pi, Grishin, 2001), SEGID
(Wang,Zu,2003), NorMD - Conserved regions
- LEON (Thompson et al, 2004), MCOFFEE (Moretti et
al, 2007)
43Homology analysis with LEON
- vertical analysis sequence clustering,
intermediate sequences - horizontal analysis residue conservation,
motif context information - composition analysis prediction of
compositionally biased segments
- Homologous regions are delineated
- Removal of sequences non-homologous to query
44Homology analysis with LEON
Query sequence DKK1_HUMAN
BlastP results
DKK1_HUMAN Dickkopf related protein-1 precursor 1e-151
DKK3_MOUSE Dickkopf related protein-3 precursor 8e-07
TXCA_CAEEX Neurotoxic peptide caeron precursor. 0.007
PRK1_RAT Prokineticin 1 precursor 0.021
VPRA_DENPO Intestinal toxin 1 _MIT 0.10
Q8BKK7 MEGF11 protein. 0.10
COL_RABIT Colipase precursor. 0.13
PRK2_HUMAN Prokineticin 2 precursor 0.17
Q7XZ34 Growth factor _Fragment_. 0.17
1imt_ VENOM. MAMBA INTESTINAL TOXIN 1, 0.23
Q863H5 Bv8/prokineticin 2-like protein. 0.30
VE6_RHPV1 E6 protein. 1.1
COL_CANFA Colipase precursor. 3.3
Q9Y7V5 Conidiospore surface protein. 3.3
COLA_HORSE Procolipase A precursor _Fragment_. 4.3
O00508 Latent TGF-beta binding protein-4. 5.6
1pco_ LIPASE PROTEIN COFACTOR. 7.3
Q8SRF4 GTP binding protein. 7.3
NTC1_MOUSE Neurogenic locus notch homolog 9.6
45Homology analysis with LEON
dkk1
dkk2
dkk3
Prokinecitin/ Intestinal toxin
Lipase protein cofactor
46Structural proteomics target characterisation
Detection of structural homologs for targets in
the SPINE (Structural Proteomics in Europe)
project
47Conserved residue analysis
- Active site residues are under evolutionary
pressure to maintain their functional integrity
and undergo fewer mutations than less
functionally important amino acids - Methods
- Evolutionary trace (Lichtarge et al, 1996)
sequence conservation patterns in homologous
proteins are mapped onto the protein surface to
generate clusters identifying functional
interfaces
48Conserved residue analysis
- Comparison of sequence-based methods
- FRcons combines information
- conservation at each site
- amino acid distribution
- predicted secondary structure (ss)
- predicted relative solvent accessibility (rsa)
FRcons Fischer et al. Bioinformatics 2008
49OrdAli Ordered Alignment Analysis
color scheme
- residues conserved in all sequences in family
- structural or functional importance
characteristic motifs - residues conserved within a sub-group of
sequences - discriminant residues
50Schematic alignment of aspartyl-tRNA synthetases
- universal proteins, play a key role in traduction
320
180
280
300
200
260
240
220
Anticodon binding domain
340
360
380
400
420
440
460
480
500
520
540
560
P
L Q PQ KQ
R
Motif I
Flipping
Motif II
loop
Insertion domain
Catalytic core I
690
890
710
730
750
770
790
810
830
850
870
930
G
H
Euc
Family conserved ArchaeaBacteria
ArchaeaEukaryote
Arc
Bac
Motif III
Catalytic core II
51PipeAlign automatic protein analysis
http//www-igbmc.u-strasbg.fr/PipeAlign/
52(No Transcript)
53Multiple sequence alignment editors
No automatic method is 100 reliable - manual
verification and refinement is essential!
SeqLab GCG Wisconsin Package SeaView (Gaultier et
al, 1996) http//pbil.univ-lyon1.fr/software/seavi
ew.html UNIX/Linux, Windows 95, MAC OS
8,9,X WEB servers GeneAlign (Kurukawa)
http//www.gen-info.osaka-u.ac.jp/geneweb2/geneali
gn/ Jalview (Clamp, 1998) http//www.ebi.ac.uk/mi
chele/jalview/ CINEMA (Lord et al, 2002)
http//www.bioinf.man.ac.uk/dbbrowser/cinema-mx
54Multiple Sequence Alignment
- Introduction what is a multiple alignment?
- Multiple alignment construction
- Traditional approaches optimal, progressive
- Alignment parameters
- Iterative and co-operative approaches
- Multiple alignment analysis
- Conserved/homologous regions
- Quality analysis/error detection
- Multiple alignment applications
55Central role of multiple alignments
domain structure
conserved, functional sites
56Central role of multiple alignments
Multiple alignment
57Example protein, RNA complexes
ASP tRNA
ASP tRNA synthetase
aspRS, tRNA interactions
Ruff et al, 1991
58Example Bardet Biedl Syndrome
Identification of new genes responsible for BBS
a rare recessive autosomic genetic
disease, probably caused by a defect at the basal
body of ciliated cells Phenotypes obesity,
retinopathy, polydactyly, mental retardation,
hypogonadism, renal failure 9 genes are known to
be involved BBS1 BBS9
In a comparative genomics study, Li et al, (2004)
identified 688 genes implicated in cilia and
flagella
BBS10 gene shows a high frequency of mutation
(20 of patients)
- Clinical studies have identified a candidate
chromosomic region of 8Mb with approx. 23 genes - including 4 genes from set of 688
J. Muller et al 2006