Multiple Sequence Alignment

About This Presentation

Title:

Multiple Sequence Alignment

Description:

... interfaces Conserved residue analysis Comparison of sequence-based methods Central role of multiple alignments Bioinformatique Multiple Sequence Alignment ... – PowerPoint PPT presentation

Number of Views:374

Avg rating:3.0/5.0

Slides: 59

Provided by: LEC64

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment

1
Multiple Sequence Alignment
Julie Thompson Laboratory of Integrative
Bioinformatics and Genomics IGBMC, Strasbourg,
France julie_at_igbmc.fr
2
Multiple Sequence Alignment

Introduction what is a multiple alignment?
Multiple alignment construction
Traditional approaches optimal, progressive
Alignment parameters
Iterative and co-operative approaches
Multiple alignment analysis
Quality analysis/error detection
Conserved/homologous regions
Multiple alignment applications

3
What is a multiple alignment?

a representation of a set of sequences, where
equivalent residues (e.g. functional, structural)
are aligned in rows or more usually columns

Example part of an alignment of SH2 domains from
14 sequences
conserved identical residues conserved
similar residues
4
What is a multiple alignment?
conserved residues
secondary structure
conservation profile
5
Multiple Sequence Alignment

Introduction what is a multiple alignment?
Multiple alignment construction
Traditional approaches optimal, progressive
Alignment parameters
Iterative and co-operative approaches
Multiple alignment analysis
Quality analysis/error detection
Conserved/homologous regions
Multiple alignment applications

6
Multiple Alignment Construction

Optimal multiple alignment
example MSA (Lipman et al. 1989, Gupta et al.
1995)

7
Optimal multiple alignment
Extension of dynamic programming for 2 sequences
gt N dimensions
Example alignment of 3 sequences
Problem calculation time and memory
requirements Time proportional to Nk for k
sequences of length N gt limited to less than 10
sequences
Alignment of 5 sulfate binding proteins, length
224-263 residues MSA OMA ClustalW gt12hours 6
2.9min 0.6sec
8
Multiple Alignment Construction

Optimal multiple alignment
MSA, OMA
Progressive multiple alignment
ClustalW (Thompson et al. NAR. 1994)
ClustalX (Thompson et al. NAR. 1997)

9
Progressive multiple alignment
Idea Progressively align pairs of sequences
(or groups of sequences)
10
Progressive multiple alignment
1) Pairwise alignments of all sequences
The alignment can be obtained by - local or
global method - dynamic programming or heuristic
method (eg. K-tuple count)
Hbb_human 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
. . . Hba_human 2
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.
...
Ex local pairwise alignments of globin sequences
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
Hbb_horse 1
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF.DLSH ...
. . . Hbb_horse 3
LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
11
Progressive multiple alignment
Example in ClustalW/X distance between 2
sequences 1-
2) Construction of a distance matrix
No. identical residues
No. aligned residues
- .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 -
.81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 -
1
Ex 7 globin sequences
2
3
4
5
6
7
12
Progressive multiple alignment

Sequential branching
Construction of a guide tree
- Neigbor-Joining (NJ)
- UPGMA
- Maximum likelihood

3) Decide order of alignment
13
Progressive multiple alignment
4) Progressive multiple alignment
The sequences are aligned progressively (global
or local algorithm) - alignment of 2
sequences - alignment of 1 sequence and a
profile (group of sequences) - alignment of 2
profiles (groups of sequences)
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
14
Progressive multiple alignment
H1
H3
H2
H4
H6
H7
H5
15
Progressive multiple alignment
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
SB - sequential branching
UPGMA- Unweighted Pair Grouping Method ML -
maximum likelihood NJ - neighbor-joining
16
Alignment parameters similarity matrices
Dynamic programming methods score an alignment
using residue similarity matrices, containing a
score for matching all pairs of residues
For nucleotide sequences
Transitions (A-G or C-T) are more frequent than
transversions (A-T or C-G)
More complex matrices exist where matches between
ambiguous nucleotides are given values whenever
there is any overlap in the sets of nucleotides
represented
17
Alignment parameters similarity matrices
For proteins, a wide variety of matrices
exist Identity, PAM, Blosum, Gonnet etc.
Matrices are generally constructed by observing
the mutations in large sets of alignments, either
sequence-based or structure-based
Matrices range from strict ones for comparing
closely related sequences to soft ones for very
divergent sequences.
e.g. PAM250 corresponds to an evolutionary
distance of 250, or approximately 80 residue
divergence PAM1 corresponds to less than
1 divergence
18
Alignment parameters similarity matrices
A single best matrix does not exist!

Altschul, 1991 suggests PAM250 for related
sequences, PAM120 when the sequences are not
known to be related and PAM40 to search for short
segments of highly similar sequences.
Henikoff, Henikoff, 1993 suggest Blosum62 as a
good all-round matrix, Blosum45 for more
divergent sequences and Blosum100 for strongly
related sequences
ClustalW automatically selects a suitable
matrix depending on the observed pairwise
identity
By default ID gt35 Gonnet 80
35gtID gt25 Gonnet 250
lt25ID Gonnet 350

19
Alignment parameters gap penalties

A gap penalty is a cost for introducing gaps into
the alignment, corresponding to insertions or
deletions in the sequences

SFGDLSNPGAVMG HF-DLS-----HG

proportional gap costs charge a fixed penalty for
each residue aligned with a gap - the cost of a
gap is proportional to its length

GAP_COSTuk where k is the length of gap

linear or affine gap costs define a cost for
introducing or opening a gap, plus a
length-dependent extension cost

GAP_COSTvuk where v is the gap opening cost,
u is the
gap extension cost
20
Alignment parameters gap penalties

ClustalW uses position-specific gap penalties to
make gaps more or less likely at different
positions in the alignment

Gap penalties are lowered at existing gaps and
increased near to existing gaps
Gap penalties are lowered in hydrophilic
stretches
Otherwise, gap opening penalties are modified
according to their observed relative frequencies
adjacent to gaps (Pascarella Argos, 1992)

Goal is to introduce gaps in sequence segments
corresponding to flexible regions of the protein
structure
21
Multiple Alignment Construction

Optimal multiple alignment
MSA, OMA
Progressive multiple alignment
ClustalW, ClustalX
Iterative multiple alignment
PRRP (Gotoh, 1993)
SAGA (Notredame et al. NAR. 1996)
DIALIGN (Morgenstern et al. 1999)
HMMER (Eddy 1998), SAM (Karplus et al. 2001)

22
Iterative refinement
PRRP (Gotoh, 1993) refines an initial progressive
multiple alignment by iteratively dividing the
alignment into 2 profiles and realigning them.
divide sequences into 2 groups
pairwise profile alignment
profile 1
refined alignment
initial alignment
Global progressif
profile 2
no
23
Genetic Algorithms
SAGA (Notredame et al.1996) evolves a population
of alignments in a quasi evolutionary manner,
iteratively improving the fitness of the
population
24
Segment-to-segment alignment
Dialign (Morgenstern et al. 1996) compares
segments of sequences instead of single residues
1. construct dot-plots of all possible pairs of
sequences
Sequence i
Sequence j
2. find a maximal set of consistent diagonals in
all the sequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq........
.......WWNAedsegkr.GMIPVPYVek.......... ........nl
FVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCE
Aqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpq
edgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..G
MFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKG
DEYFIleesnlp...............WWRArdkngqe.GYIPSNYVtea
eds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaid
eg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg........
.......WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyq
YRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLN
GynettgerGDFPGTYVeyigrkkisp..
3. Local alignment - residues between the
diagonals are not aligned
25
Multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
prrp
Genetic Algo.
HMM
dialign
saga
hmmt
Iterative
26
League Table based on BAliBASE benchmark database
Comparison of programs
Reference 1 lt 6 sequences
Reference 5 long insertions
Reference 3 several sub-families
Reference 4 long N/C terminal extensions
Reference 2 a family with an orphan

lt 100 résidues
gt 400 résidues
Tous
All
N
/
A
N
/
A
N
/
A
N
/
A
GLOBAL
iterative
N
/
A
N
/
A
LOCAL
iterative

Iterative algorithms can improve alignment
quality, but can be slow

Global algorithms work well when sequences are
homologous over their full lengths, local
algorithms are better for non-colinear sequences

Thompson et al. 1999
27
Multiple Alignment Construction

Optimal multiple alignment
MSA, OMA
Progressive multiple alignment
ClustalW, ClustalX
Iterative multiple alignment
PRRP, SAGA, DIALIGN, HMMER, SAM
Co-operative multiple alignment
T-COFFEE (Notredame et al. 2000)
http//igs-server.cnrs-mrs.fr/Tcoffee/
DbClustal (Thompson et al. 2000)
http//www-igbmc.u-strasbg.fr/BioInfo/
MAFFT (Katoh et al. 2002) http//www.biophys.kyoto
-u.ac.jp/katoh/programs/align/mafft/
MUSCLE (Edgar, 2004) http//www.drive5.com/muscle
Probcons (Do et al. 2005)
Kalign (Lassmann et al. 2005)

28
DbClustal
http//bips.u-strasbg.fr/PipeAlign/
Blast Database Search
Query Sequence
Database Hits
Domain A
Domain B
Domain C
29
Comparaison ClustalW / DbClustal
ClustalW
DbClustal
30
MAFFT

Local homologous segments detected using a Fast
Fourier Transform
Pairwise alignments are performed using
restricted global dynamic programming
Multiple alignment is built up using a
progressive algorithm, similar to ClustalW
Multiple alignment is then iteratively refined
by dividing alignment into 2 parts and realigning

31
MAFFT
Pairwise alignments
c(k)
k
2
-1
1. Fast Fourier Transform to detect local
conserved segments
2. Segment Level Dynamic Programming to select
consistent segments
3. Fix residues at the centre of each segment
pair and realign between fixed points (white
regions only)
32
State-of-the-art

Co-operative algorithms have led to significant
improvements

Ref 11 lt20 ID
BAliBASE 3
Ref 12 20-40 ID
Ref 5 insertions
Ref 2 orphan
Ref 4 extensions
Ref 3 sub-families
but none of the methods currently available
are capable of producing high-quality alignments
for all test cases
Thompson et al. 2005, 2006
33
RNA alignment methods

Comparison using BRAliBASE RNA structure
alignments (Gardner et al, 2005)

Above 60 identity, sequence and structure based
approaches have similar scores
Algorithms incorporating structural information
outperform pure sequence methods. However, these
algorithms are computationally demanding which
severely limits their use in practice.

Some more recent methods
Sequence R-Coffee (Wilm, 2008), MAFFT (Katoh,
2008)
Structure LARA (Bauer, 2007), FoldalignM
(Torarinsson, 2007), SCARNA (Tabei, 2008)

34
DNA alignment methods

Complete genomes
Local alignments (BlastZ, MultiZ, MUMmer,)
Global alignments (MGA, Multi-LAGAN, MAVID,
MAUVE, MAP2, Mulan,)

Reviewed in Dewey and Pachter, Human Molecular
Genetics, 2006
35
Multiple Sequence Alignment

Introduction what is a multiple alignment?
Multiple alignment construction
Traditional approaches optimal, progressive
Alignment parameters
Iterative and co-operative approaches
Multiple alignment analysis
Quality analysis/error detection
Conserved/homologous regions
Multiple alignment applications

36
Multiple alignment analysis

Are the sequences correctly aligned?
Quality analysis alignment objective functions
(SP, NorMD)
error detection and correction (RASCAL, Refiner)
Are the sequences in the alignment homologous?
Conserved/homologous regions (MCOFFEE, LEON)
Conserved (functional) residues

37
Objective functions
Sum-of-pairs (Carrillo, Lipman, 1988) Sum of
scores for all pairs of sequences
Blosum62 N C N 6 -3 C -3 9
Seq1-2 3 pairs N-N 3x618
Sequence 1 N N N Sequence 2 N N
N Sequence 3 N N C Sequence 4 N C C
Seq1-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq1-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq2-3 2 pairs N-N, 1 pair N-C 2x6(-3)9
Seq2-4 1 pair N-N, 2 pairs N-C 62x(-3)0
Seq3-4 1 pair N-N, 1 pair N-C, 1 pair
CC 6(-3)912
48

Information content (Hertz et al, 1999)
Entropy column scores (between 0 and 1), sum for
all columns in the alignment

norMD (Thompson et al, 2001)
Column scores
normalisation for sequence set to be aligned
(number, length, similarity)
lt0.3 bad alignment
0.3-0.7 some local errors
gt0.7 good alignment

38
Objective functions NorMD
Window length 8
Window length 40
39
Error detection and correction

RASCAL (Thompson et al, 2003), Refiner
(Chakrabati et al, 2006)

RASCAL
40
Error detection and correction

RASCAL, errors within core blocks

metalloprotease
41
Error detection and correction

RASCAL, errors between core blocks

methyltransferase
42
Homology detection methods

Sequence percent identity
gt30 identity ? sequences are homologous
15-30 identity ? twilight zone
local analysis of positional conservation
AL2CO (Pi, Grishin, 2001), SEGID
(Wang,Zu,2003), NorMD
Conserved regions
LEON (Thompson et al, 2004), MCOFFEE (Moretti et
al, 2007)

43
Homology analysis with LEON

vertical analysis sequence clustering,
intermediate sequences
horizontal analysis residue conservation,
motif context information
composition analysis prediction of
compositionally biased segments

Homologous regions are delineated
Removal of sequences non-homologous to query

44
Homology analysis with LEON
Query sequence DKK1_HUMAN
BlastP results

DKK1_HUMAN Dickkopf related protein-1 precursor 1e-151
DKK3_MOUSE Dickkopf related protein-3 precursor 8e-07
TXCA_CAEEX Neurotoxic peptide caeron precursor. 0.007
PRK1_RAT Prokineticin 1 precursor 0.021
VPRA_DENPO Intestinal toxin 1 _MIT 0.10
Q8BKK7 MEGF11 protein. 0.10
COL_RABIT Colipase precursor. 0.13
PRK2_HUMAN Prokineticin 2 precursor 0.17
Q7XZ34 Growth factor _Fragment_. 0.17
1imt_ VENOM. MAMBA INTESTINAL TOXIN 1, 0.23
Q863H5 Bv8/prokineticin 2-like protein. 0.30
VE6_RHPV1 E6 protein. 1.1
COL_CANFA Colipase precursor. 3.3
Q9Y7V5 Conidiospore surface protein. 3.3
COLA_HORSE Procolipase A precursor _Fragment_. 4.3
O00508 Latent TGF-beta binding protein-4. 5.6
1pco_ LIPASE PROTEIN COFACTOR. 7.3
Q8SRF4 GTP binding protein. 7.3
NTC1_MOUSE Neurogenic locus notch homolog 9.6
45
Homology analysis with LEON
dkk1
dkk2
dkk3
Prokinecitin/ Intestinal toxin
Lipase protein cofactor
46
Structural proteomics target characterisation
Detection of structural homologs for targets in
the SPINE (Structural Proteomics in Europe)
project
47
Conserved residue analysis

Active site residues are under evolutionary
pressure to maintain their functional integrity
and undergo fewer mutations than less
functionally important amino acids
Methods
Evolutionary trace (Lichtarge et al, 1996)
sequence conservation patterns in homologous
proteins are mapped onto the protein surface to
generate clusters identifying functional
interfaces

48
Conserved residue analysis

Comparison of sequence-based methods

FRcons combines information
conservation at each site
amino acid distribution
predicted secondary structure (ss)
predicted relative solvent accessibility (rsa)

FRcons Fischer et al. Bioinformatics 2008
49
OrdAli Ordered Alignment Analysis
color scheme

residues conserved in all sequences in family
structural or functional importance
characteristic motifs
residues conserved within a sub-group of
sequences
discriminant residues

50
Schematic alignment of aspartyl-tRNA synthetases

universal proteins, play a key role in traduction

320
180
280
300
200
260
240
220
Anticodon binding domain
340
360
380
400
420
440
460
480
500
520
540
560
P
L Q PQ KQ
R
Motif I
Flipping
Motif II
loop
Insertion domain
Catalytic core I
690
890
710
730
750
770
790
810
830
850
870
930
G
H
Euc
Family conserved ArchaeaBacteria
ArchaeaEukaryote
Arc
Bac
Motif III
Catalytic core II
51
PipeAlign automatic protein analysis
http//www-igbmc.u-strasbg.fr/PipeAlign/
52
(No Transcript)
53
Multiple sequence alignment editors
No automatic method is 100 reliable - manual
verification and refinement is essential!
SeqLab GCG Wisconsin Package SeaView (Gaultier et
al, 1996) http//pbil.univ-lyon1.fr/software/seavi
ew.html UNIX/Linux, Windows 95, MAC OS
8,9,X WEB servers GeneAlign (Kurukawa)
http//www.gen-info.osaka-u.ac.jp/geneweb2/geneali
gn/ Jalview (Clamp, 1998) http//www.ebi.ac.uk/mi
chele/jalview/ CINEMA (Lord et al, 2002)
http//www.bioinf.man.ac.uk/dbbrowser/cinema-mx
54
Multiple Sequence Alignment

Introduction what is a multiple alignment?
Multiple alignment construction
Traditional approaches optimal, progressive
Alignment parameters
Iterative and co-operative approaches
Multiple alignment analysis
Conserved/homologous regions
Quality analysis/error detection
Multiple alignment applications

55
Central role of multiple alignments
domain structure
conserved, functional sites
56
Central role of multiple alignments
Multiple alignment
57
Example protein, RNA complexes
ASP tRNA
ASP tRNA synthetase
aspRS, tRNA interactions
Ruff et al, 1991
58
Example Bardet Biedl Syndrome
Identification of new genes responsible for BBS
a rare recessive autosomic genetic
disease, probably caused by a defect at the basal
body of ciliated cells Phenotypes obesity,
retinopathy, polydactyly, mental retardation,
hypogonadism, renal failure 9 genes are known to
be involved BBS1 BBS9
In a comparative genomics study, Li et al, (2004)
identified 688 genes implicated in cilia and
flagella
BBS10 gene shows a high frequency of mutation
(20 of patients)