Scoring multiple sequence alignments - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Scoring multiple sequence alignments

Description:

Multiple Sequence Alignment (MSA) Local alignment / Global alignment Pairwise alignment / Multiple alignment Multiple alignment construction Progressive multiple ... – PowerPoint PPT presentation

Number of Views:496
Avg rating:3.0/5.0
Slides: 79
Provided by: LEC52
Category:

less

Transcript and Presenter's Notes

Title: Scoring multiple sequence alignments


1
Multiple Sequence Alignment (MSA)
2
Plan
  • Introduction to sequence alignments
  • Multiple alignment construction
  • Traditional approaches
  • Alignment parameters
  • Alternative approaches
  • Multiple alignment main applications
  • MACSIMS Multiple Alignment of Complete
    Sequences Information Management System

3
Local alignment / Global alignment
Sequence A
Sequence B
Optimal global pairwise alignment Needleman and
Wunsch, 1970
Optimal local pairwise alignment Smith and
Waterman, 1981
4
Pairwise alignment / Multiple alignment
5
What is a multiple alignment?
A representation of a set of sequences, in which
equivalent residues (e.g. functional or
structural) are aligned in columns
Conserved residues
Conservation profile
Secondary structure
6
MACS
  • Schematic overview of complete alignment
  • e.g. domain organisation (Interpro)

Key
CH
SH3
PI-PLC-X
SH2
PI-PLC-Y
rhoGEF
DAG_PE-bind
PH
C2
7
Why multiple alignments?
Integration of a sequence in the context of the
protein family
  • Applications
  • phylogeny
  • domain organisation
  • functional residue identification
  • 2D/3D structure prediction
  • transmembrane prediction

8
MSA Construction
9
Multiple alignment construction
  • Traditional approaches
  • Optimal multiple alignment
  • Progressive multiple alignment
  • Alignment parameters
  • Residue similarity matrices
  • Gap penalties
  • Alternative approaches
  • Iterative alignment methods
  • Combinatorial algorithms
  • PipeAlign a protein family analysis tool

10
Traditional Approaches
11
Optimal multiple alignment
Is the direct extension of pairwise dynamic
programming to N-dimension (Sankoff,
1975). Examine all possible alignments to find
the optimal alignment
Exemple alignment of 3 sequences
Problem The optimised mathematical alignment is
not necessarily the biologically optimal
alignment CPU time and memory required are
prohibitive for practical purposes (the required
time is proportional to Nk for k sequences with
length N) limited to lt10 sequences
12
Progressive multiple alignment
Heuristic algorithm which avoids calculating all
possible alignments, but does not garuantee
optimal alignment
Principle Progressively align the sequences
(or sequence groups) by pair
13
Progressive multiple alignment
Example Alignment of 7 globins (Hbb_human,
Hbb_horse, Hba_human, Hba_horse, Myg_phyca,
Glb5_petma and Lgb2_lupla)
Step 1 Pairwise alignment of all sequences
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
Hbb_horse 2
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
Ex pairwise alignment of 2 globin sequences
Hbb_human 1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
. . . Hba_human 3
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.
...
Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF.DLSH ...
. . . Hbb_horse 2
LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
The alignment can be obtained with - global or
local method - dynamic programming or heuristic
methods Example in Clustalx gt global
alignments gt choice between - heuristic method
(used in Fasta program) gt faster -
dynamic programming (Smith Waterman) gt
better
14
Progressive multiple alignment
Step 2 Distance matrix construction
In Clustalx
distance between 2 sequences 1-
nb of identical residues
nb of compared residues
Ex Hbb_human vs Hbb_horse 83 identity 17
distance
- .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 -
.81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 -
1
2
3
4
5
6
7
15
Progressive multiple alignment
Step 3 Sequential branching / Guide tree
construction
Sequential branching
Guide tree
Hba_human
Hba_horse
Hba_human
Hbb_horse
Hba_horse
Hbb_human
Glb5_petma
Myg_phyca
Lgb2_lupla
- Join the 2 closest sequences - Recalculate
distances and join the 2 closest sequences or
nodes - Step 3 is repeated until all sequences
are joined
16
Progressive multiple alignment
Step 4 Progressive alignment
The progressive multiple alignment follows the
branching order in tree
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx x
xxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Hba_human
Hba_horse
17
Progressive multiple alignment
H1
H3
H2
H4
H6
H7
H5
18
Progressive multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
SB - Sequential Branching UPGMA - Unweighted
Pair Grouping Method ML - Maximum Likelihood NJ
- Neighbor-Joining
19
Alignment Parameters
20
Residue similarity matrices
  • Dynamic programming methods score an alignment
    using residue similarity matrices, containing a
    score for matching all pairs of residues
  • For proteins, a wide variety of matrices exist
    Identity, PAM, Blosum, Gonnet etc.

21
Residue similarity matrices
  • Dynamic programming methods score an alignment
    using residue similarity matrices, containing a
    score for matching all pairs of residues
  • For proteins, a wide variety of matrices exist
    Identity, PAM, Blosum, Gonnet etc.
  • Matrices are generally constructed by observing
    the mutations in large sets of alignments, either
    sequence-based or structure-based
  • Matrices range from strict ones for comparing
    closely related sequences to soft ones for very
    divergent sequences.

A single best matrix does not exist!!
ClustalW automatically selects a suitable matrix
depending on the observed pairwise identity.
22
Gap penalties
  • A gap penalty is a cost for introducing gaps into
    the alignment, corresponding to insertions or
    deletions in the sequences

SFGDLSNPGAVMG HF-DLS-----HG
Goal is to introduce gaps in sequence segments
corresponding to flexible regions of the protein
structure
23
Alternative Approaches
24
Iterative alignment methods
  • Iterative Alignment e.g. PRRP (Gotoh, 1993)
  • - refine an initial progressive multiple
    alignment by iteratively dividing the alignment
    into 2 profiles and realigning them.
  • Genetic Algorithms e.g. SAGA (Notredame et al,
    1996)
  • - iteratively refine an alignment using genetic
    algorithms (evolves a population of alignments in
    a quasi evolutionary manner)
  • Segment-to-segment alignment DIALIGN
    (Morgenstern et al. 1999)
  • - search for locally conserved motifs in all
    sequences and compares segments of sequences
    instead of single residues
  • Hidden Markov Models
  • - iteratively refine an alignment using HMMs
  • e.g. HMMER (Eddy, 1998)
  • SAM (Karplus et al, 2001)

25
Multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
prrp
Genetic Algo.
HMM
dialign
saga
hmmt
Iterative
26
BAliBASE objective evaluation of MACS programs
  • High-quality alignments based on 3D structural
    superpositions and manually verified
  • Alignments compared only in reliable core
    blocks, excluding non-superposable regions
  • Separate reference sets specifically designed to
    address distinct alignment problems

BAliBASE1 Thompson et al. 1999
Bioinformatics BAliBASE2 Bahr et al, 2001 Nucl
Acids Res.
reference set description
1 small number of sequences divergence, length
2 a family with one to 3 orphans
3 several sub-families
4 long N/C terminal extensions
5 long insertions
6 repeats
7 transmembrane regions
8 circular permutations
27
Comparison of multiple alignment methods
gt Need of reference alignments to evaluate the
alignment programs
  • BaliBASE (Thompson et al. Bioinformatics. 1999)
    benchmark database
  • Alignments based on 3D structure superposition
  • Alignments must be compared for the superposable
    regions
  • Alignments take into account
  • - the effect of the number of sequences
  • - the effect of the sequence length
  • - the effect of the sequence similarity
  • - alignment of an orphan sequence with a
    sequence family
  • - sub-family alignments
  • - alignments of sequences with different length
    (insertions,extensions)

28
Comparison of multiple alignment methods
gt 35 Id any method
Local / global methods
  • Colinear sequences gt global methods
  • N/C-ter extensions or insertions gt local
    methods

Progressive / iterative methods
  • Iterative algorithms usually improve alignment
    quality
  • Problems
  • - Can give bad alignment in case of orphan
    sequences
  • - Iteratif process can be very long !

Example alignment of 89 histone sequences
(66-92 residues)
ClustalW 2 mins 41 secs PRRP 3 hours 40
mins Dialign 3 hours 48 mins
To increase the alignment quality, as many
sequences as possible have to be integrated !
29
DbClustal local and global algorithm coupling
Blast Database Search
Query Sequence
Database Hits
Domain A
Domain B
Domain C
30
ClustalW / DbClustal comparison
ClustalW
DbClustal
31
Combinatorial algorithms
  • T-Coffee (Notredame et al. 2000)
    http//igs-server.cnrs-mrs.fr/Tcoffee/
  • performs local and global alignments for all
    pairs of sequences, then combines them in a
    progressive multiple alignment, similar to
    ClustalW.
  • DbClustal (Thompson et al. 2000)
    http//bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbC
    lustalnoid
  • designed to align the sequences detected by a
    database search. Locally conserved motifs are
    detected using the Ballast program (Plewniak et
    al. 1999) and are used in the global multiple
    alignment as anchor points.
  • MAFFT (Katoh et al. 2002) http//timpani.genome.ad
    .jp/7Emafft/server
  • detects locally conserved segments using a Fast
    Fourier Transform, then uses a restricted global
    DP and a progressive algorithm
  • MUSCLE (Edgar, 2004) http//www.drive5.com/muscle
  • kmer distances and log-expectation scores,
    progressive and iterative refinement
  • PROBCONS (Do et al, 2005) http//probcons.stanford
    .edu
  • pairwise consistency based on an objective
    function

32
Multiple Alignment Quality
Truncated Alignments
Ref1 Ref2 Ref3 Ref4 Ref5 Time
V1 (lt20) V2 (20-40) orphans subgroups extensions insertions (sec)
ClustalW1.83 0.42 0.78 0.42 0.52 0.41 0.38 902
Dialign2.2.1 0.31 0.71 0.37 0.39 0.45 0.43 5993
Mafft5.32 0.44 0.78 0.49 0.53 0.47 0.48 96
Maffti5.32 0.54 0.83 0.56 0.60 0.49 0.57 327
Muscle3.51 0.52 0.82 0.50 0.58 0.46 0.54 523
Muscle_fast 0.40 0.77 0.43 0.44 0.35 0.49 34
Muscle_med 0.45 0.80 0.50 0.59 0.44 0.51 219
Tcoffee2.66 0.47 0.84 0.50 0.64 0.54 0.58 216133
Probcons1.1 0.63 0.87 0.60 0.65 0.54 0.63 19035
1. Significant improvement in accuracy/efficiency
since 2000
2. Twilight zone still exists
3. Probcons scores best in all tests, but is MUCH
slower than MAFFT or MUSCLE
4. MAFFTI scores slightly better than MUSCLE in
all test, and is more efficient
muscle_fast muscle maxiters1 diags1 sv
distance1 kbit20_3 muscle_medium muscle
maxiters2
33
Multiple Alignment Quality
Comparison truncated versus full-length sequences
Ref1 Ref1 Ref1 Ref1 Ref2 orphans Ref2 orphans Ref3 subgroups Ref3 subgroups Time (sec) for all refs Time (sec) for all refs
V1 (lt20) V1 (lt20) V2 (20-40) V2 (20-40) Ref2 orphans Ref2 orphans Ref3 subgroups Ref3 subgroups Time (sec) for all refs Time (sec) for all refs
T FL T FL T FL T FL T FL
ClustalW1.83 0.42 0.24 0.78 0.72 0.42 0.20 0.52 0.27 902 2227
Dialign2.2.1 0.31 0.26 0.71 0.70 0.37 0.29 0.39 0.31 5993 12595
Mafft5.32 0.44 0.25 0.78 0.75 0.49 0.35 0.53 0.38 96 312
Maffti5.32 0.54 0.35 0.83 0.80 0.56 0.40 0.60 0.50 327 1409
Muscle3.51 0.52 0.34 0.82 0.79 0.50 0.36 0.58 0.39 523 3608
Muscle_fast 0.40 0.28 0.77 0.72 0.43 0.29 0.44 0.33 34 132
Muscle_med 0.45 0.29 0.80 0.74 0.50 0.34 0.59 0.38 219 1601
Tcoffee2.66 0.47 0.35 0.84 0.82 0.50 0.40 0.64 0.49 216133 341578
Probcons1.1 0.63 0.43 0.87 0.86 0.60 0.41 0.65 0.54 19035 58488
  1. Loss of accuracy is more important in twilight
    zone (Ref1 V1, orphans, and subgroups)
  2. Probcons still scores best in all tests
  3. MAFFT still scores better than MUSCLE in all tests

34
Multiple alignment quality
Development of objective functions to estimate
multiple alignment quality
  • Sum-of-pairs (Carrillo, Lipman, 1988)
  • Sum the scores of all the pair of sequences
    (based on a similarity matrix and gap penalty)
  • Relative Entropy
  • uses a normalized log-likelihood ratio to measure
    the degree of conservation for each column
    (identical residues only).
  • MD
  • (column scores used in ClustalX) uses a
    comparison matrix (Gonnet) to take into account
    similar residues
  • norMD (Thompson et al, 2001)
  • - scores by column using a substitution matrix
    and gap penalties
  • - normalisation according to the sequences to
    align (their number, length and the similarity
    between them)

35
Evaluation of Objective Functions using BAliBase
36
Multiple sequence alignment editors
No automatic method is 100 reliable. Manual
verification and refinement is essential!
SeqLab GCG Wisconsin Package SeaView (Gaultier
et al, 1996) http//pbil.univ-lyon1.fr/software/se
aview.html WEB servers GeneAlign (Kurukawa)
http//www.gen-info.osaka-u.ac.jp/geneweb2/geneali
gn/ Jalview (Clamp, 1998) http//www.ebi.ac.u
k/michele/jalview/ CINEMA (Lord et al, 2002)
http//www.bioinf.man.ac.uk/dbbrowser/cinema-mx
37
FASTA format
gtO88763 Phosphatidylinositol 3-kinase. ------MGEAE
KFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC
SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRN
AQVALTIWD- -----VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKV
WPNVEADGSEPTRTPGRTSST LSEDQMSRLAKLTKAHRQGHMVKVLDRL
TFREIEMINESEKRSS--NFMYLMVEFRCVKC DDKE-YGIVYYE----
gtQ9W1M7 CG5373-PA (GH13170p). -----MDQPDDHFRYIHSSS
LHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH PSFQVRLQV
FNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD
- -----CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGN
FPSRTPGK-GKE SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVIN
EREKRMS--DYMFLMIEFPAIVV DDMYNYAVVYFE---- gtQ7PMF0
ENSANGP00000002906 (Fragment). ------------LRYIGSS
SLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT PPLKVKLQ
IFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIY
D- -----CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDG
ACNSITPGK-AIT TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVI
NEMEKRNS--QFLYLMVEFPQVYI HEKL-YSVIHLE---- gtQ9TXI7
Related to yeast vacuolar protein sorting factor
protein 34 MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----
IFRDVLN-PVRRLNQLFAEIT VYCNNQQIGYPVCTSFHTPPDSSQLARQ
KLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH EDDEIVNNSTFSRRLVA
QSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA WG-DEI
DLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMA
AIRL GPTF-YKVVYYEDETK
38
MSF format
toto.msf MSF 256 Type P May 24, 2005 1934
Check 3415 .. Name O88763 Len
256 Check 9443 Weight 1.00 Name Q9W1M7
Len 256 Check 1161 Weight 1.00
Name Q7PMF0 Len 256 Check 8095
Weight 1.00 Name Q9TXI7 Len 256
Check 4716 Weight 1.00 // 1

50 O88763 ......MGEA EKFHYIYSCD LDINVQLKIG
SLEGKREQKS YKAVLEDPML Q9W1M7 .....MDQPD
DHFRYIHSSS LHERVQIKVG TLEGKKRQPD YEKLLEDPIL
Q7PMF0 .......... ..LRYIGSSS LLQKISIKIG
TLEGENVGYS YEKLIEQPLL Q9TXI7 MIPGMRATPT
ESFSFVYSCD LQTNVQVKVA EFEG.....I FRDVLN.PVR
51
100 O88763 KFSGLYQETC SDLYVTCQVF AEGKPLALPV
RTSYKPFSTR WN.WNEWLKL Q9W1M7 RFSGLYSEEH
PSFQVRLQVF NQGRPYCLPV TSSYKAFGKR WS.WNEWVTL
Q7PMF0 KFSGMYTEKT PPLKVKLQIF DNGEPVGLPV
CTSHKHFTTR WS.WNEWVTL Q9TXI7 RLNQLFAEIT
VYCNNQQIGY PVCTSFHTPP DSSQLARQKL IQKWNEWLTL
101
150 O88763 PVKYPDLPRN AQVALTIWD. .....VYGPG
.RAVPVGGTT VSLFGKYGMF Q9W1M7 PLQFSDLPRS
AMLVLTILD. .....CSGAG .QTTVIGGTS ISMFGKDGMF
Q7PMF0 PLRFTDISRT AVLGLTIYD. .....CAGGR
EQLTVVGGTS ISFFSTNGLF Q9TXI7 PIRYSDLSRD
AFLHITIWEH EDDEIVNNST FSRRLVAQSK LSMFSKRGIL
151
200 O88763 RQGMHDLKVW PNVEADGSEP TRTPGRTSST
LSEDQMSRLA KLTKAHRQGH Q9W1M7 RQGMYDLRVW
LGVEGDGNFP SRTPGK.GKE SSKSQMQRLG KLAKKHRNGQ
Q7PMF0 RQGLYDLKVW PQMEPDGACN SITPGK.AIT
TGVHQMQRLS KLAKKHRNGQ Q9TXI7 KSGVIDVQMN
VSTTPDPFVK QPETWKYSDA WG.DEIDLLF KQVTRQSRGL
201
250 O88763 MVKVLDRLTF REIEMINESE KRSS..NFMY
LMVEFRCVKC DDKE.YGIVY Q9W1M7 VQKVLDRLTF
REIEVINERE KRMS..DYMF LMIEFPAIVV DDMYNYAVVY
Q7PMF0 MEKILDRLTF RELEVINEME KRNS..QFLY
LMVEFPQVYI HEKL.YSVIH Q9TXI7 VEDVLDPFAS
RRIEMIRAKY KYSSPDRHVF LVLEMAAIRL GPTF.YKVVY
251 O88763 YE.... Q9W1M7 FE.... Q7PMF0
LE.... Q9TXI7 YEDETK
Multiple Sequence File
39
With an editor
40
PipeAlign protein family analysis tool
http//bips.u-strasbg.fr/PipeAlign/
Plewniak et al, 2003
41
PipeAlign
42
MSA Main Applications
43
MSA central role in biology
MACS
44
MACS new landscape
High volume heterogeneity of sequence data
  • Length from tens of amino acids or nucleotides
    to thousands or millions (genomes)
  • Number from tens up to thousands of sequences
  • Variability from small percent identity to
    almost identical
  • Complexity of the sequences to be aligned
  • - Family with linear or highly irregular
    repartition of sequence variability
  • - Heterogeneity of length, structure or
    composition (large insertions or extensions,
    repeats, circular permutations, transmembrane
    regions)
  • Fidelity from 15-30 errors (sequence,
    eucaryotic gene prediction, annotation)

45
MACS new concepts
Distinct objectives imply distinct needs
strategies
  • Overview of one sequence family to quickly infer
    and integrate information from a limited number
    of closely related, well annotated sequences
    (reliable and efficient)
  • Exhaustive analysis of one sequence family for
    (very high quality)
  • - homology modeling
  • - phylogenetic studies
  • - subfamily-specific features (differentially
    conserved domains, regions or residues)
  • Massive analysis of sets of sequences
    (reliable/high quality and efficient)
  • - phylogenetic distribution, co-presence and
    co-absence and structural complex
  • - genome annotation
  • - target characterisation for functional
    genomics studies (transcriptomics)

46
Residue conservation identification
  • residues conserved in all sequences in family
  • structural or functional importance
    characteristic motifs
  • residues conserved within a sub-group of
    sequences
  • discriminant residues

47
Ordered Alignment analysis of TyrRS
Euc
Arc Euc
Bac
Motif I
Euc
Arc Euc
Bac
Motif II
10 aa
C-terminal extension
N-terminal extension
S4 domain
EMAP domain
48
Ordered Alignment analysis of TyrRS
Euc
Arc Euc
Bac
Motif I
Euc
Arc Euc
Bac
Motif II
10 aa
C-terminal extension
N-terminal extension
S4 domain
EMAP domain
49
(No Transcript)
50
Phylogenetic studies
Multiple alignments basis for calculation of
the levels of similarity between sequences
Multiple alignments basis for calculation of
sequences evolutionary distances
Multiple alignments basis for the computation
of phylogenetic trees
Creation of high quality phylogenetic tree
implies to work with high quality multiple
sequence alignments
51
Phylogenetic studies
PLASM FALC
Whole alignment
ARABI THAL
Eucarya
CAENO ELEG
SCHI PO MT
DROSO MEGA
SACC CE MT
MYCOP GENI
HOMO SAPIE
DROS ME MT
RATTU NORV
CAEN EL MT
MYCOP PNEU
Bacteria Mitochondrie
SCHIZ POMB
SACCH CERE
BORRE BURG
CANDI ALBI
TREPO PALI
MYCOP CAPR
BUCHN AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE
AQUIF AEOL
MYCOB LEPR
THERM MARI
METBA THER
METHA JANN
HELIC PYLO
PORPH GING
CAMPY JEJU
Archaea
CLOST ACET
PYROC KODA
CHLAM TRAC
BORDE PERT
PYROC HORI
SYNECHO SP
AR THA CHL
NEISS GONO
NEISS MENI
THERM THER
BACIL SUBT
DEINO RADI
PSEUD AERU
ENTER FAEC
SHEWA PUTR
YERSI PEST
ESCHE COLI
STREP PYOG
SALMO TYPH
VIBRI CHOL
HAEMO INFL
ACTIN ACTI
52
Phylogenetic studies
N terminus global gap removal
Eukarya
PLASM FALC
SACCH CERE
Bacteria Archaea Mito.
SCHIZ POMB
ARABI THAL
CANDI ALBI
CAENO ELEG
DROSO MEGA
HALOB SALI
HOMO SAPIE
PYROC HORI
RATTU NORV
METBA THER
PYROC KODA
DROS ME MT
METHA JANN
SCHI PO MT
CAEN EL MT
ARCHE FULG
BORRE BURG
SACC CE MT
MYCOP CAPR
BUCHN AFID
PORPH GING
CLOST ACET
DEINO RADI
RICKE PROW
BACIL SUBT
RHODO CAPS
MYCOP GENI
CHLOR TEPI
SYNECHO SP
MYCOP PNEU
CHLAM TRAC
BORDE PERT
NEISS MENI
NEISS GONO
HELIC PYLO
CAMPY JEJU
PSEUD AERU
SHEWA PUTR
MYCOB TUBE
SALMO TYPH
ESCHE COLI
ENTER FAEC
YERSI PEST
VIBRI CHOL
MYCOB LEPR
HAEMO INFL
ACTIN ACTI
AQUIF AEOL
STREP PYOG
THERM THER
TREPO PALI
THERM MARI
0.1
AR THA CHL
53
Schematic alignment of Aspartyl-tRNA synthetases
54
(No Transcript)
55
Protein sequence validation
Sequencing / frameshift error detection
Estimation 44 of predicted proteins from genome
sequencing projects and 31 of high-throughput
cDNA (HTC) contain errors in their intron/exon
structure. Bianchetti et al, 2005
Example transcription TFIIH complex protein
56
Clustered MACS Starter
Multiple alignment of complete sequences
Determination of sequence groups
  • Hierarchical clustering of positions
  • based on insertion/deletion
  • Definition of blocs
  • N-terminal region analysis
  • Reference position
  • Proposed N-terminus potential start codon
    closest to the reference position

--------MXXXXXX-XXXXXX-------XXX -------MXXXX-XXXX
XXXXXX------XXX MXXXXXXMXXXMXXXXX-XXXXX-XXXXXXXX -
-----MXXXXXXXXXXXXX-XX--XXXXXXX ---------MXXXXX-XX
XXXXXXXXXXXXXX
extension
Reference position
3000 proteins from B. subtilis with wrong
randomly generated N-ter. 82 predicted For
the 3828 proteins from the Vibrio cholera
proteome 817 specific / 1722 valid start
codons / 236 wrong (from 1 up to 56 aas)
57
Clustered MACS vAlid
Bianchetti et al. (2005) JBCB
58
Clustered MACS DbW
  • Databases
  • - Proteins
  • Structures

Automatic up-date of more than 300 different
protein families gt 24 AaRS (amino-acid tRNA
synhetases), nuclear receptors, ribosomal
proteins, transcription factors
Prigent et al. (2005) BioInformatics
59
Clustered MACS GOAnno
GoAnno find a pertinent level automatically and
propagate Gene Ontology to an unannotated target
protein according to clustered MACS
Chalmel et al. (2005) Bioinfomatics
60
Protein 3D structure prediction
Proteins with similar sequences tend to fold into
similar structure
? Above 50 identity, pairwise alignment is
enough for accurate model ? Below 50 identity,
multiple alignment is better
  • Basic steps for comparative (homology) modelling
  • Identify a template structure
  • Align the target sequence to the template
    sequence
  • Copy the backbone coordinates from template to
    the matching residues in the target sequence
  • Build the side-chains (copied for identical
    residues, predicted for non-identical)
  • Model the loop regions
  • Optimise (energy refinement)

Applicable to 60 of proteins from fully
sequenced genomes
61
Protein functional characterisation
By homology Similar sequences generally share
similar structures and often have similar
functions
Propagation of information from a known sequence
to an unknown one e.g. domains, active sites,
cellular localisation, post-transcriptional
modifications, 1. Database search for
homologues e.g. BlastP, PSI-Blast 2. Domain
databases e.g. Interpro (EBI), CDD (NCBI) 3.
Multiple alignment construction and analysis e.g.
PipeAlign
62
MSA applications Summary
Error in ORF definition
Additional domain
Transmembrane region
Phosphorylation site
1st FAMILY
Bacteria
Bacteria
2nd FAMILY
Archaea
Eucarya
NLS
Intra-group conservation
Universal conservation
Differential conservation between the two families
domain organization, structural motifs key
functional residues, ORF definition localization
signals, conservation pattern ...
Functional genomics
Mutagenesis experiments
Evolutionary studies
Structure modeling
Drug design
Lecompte et al Gene. 2001
63
MACSIMS
64
MAO Multiple Alignment Ontology
http//www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html
MAO consortium
- RNA analysis (Steve HOLBROOK, Berkeley) -
MACS algorithm (Kazutake KATOH, Kyoto) -
Protein 3D analysis (Patrice KOEHL, Davis) -
Protein 3D structure (Dino MORAS, Strasbourg)
- 3D RNA structure (Eric WESTHOF, Strasbourg)
Also available from OBO web site
http//obo.sourceforge.net
Thompson et al. (2005) Nucleic Acids Res.
65
MACSIMS
  • Multiple Alignment of Complete Sequences
    Information Management System

Thompson et al BMC Bioinformatics 2006
Structural and functional information is mined
automatically from the public databases
Homologous regions are identified in the MACS
Mined data is evaluated and cross-validated
Mined data is propagated from known to unknown
sequences with the homologous regions
MACSIMS provides a unique environment that
facilitates knowledge extraction and the
presentation of the most pertinent information to
the biologist
66
MACSIMS
http//bips.u-strasbg.fr/MACSIMS/
67
MACSIMS
  • Schematic overview of complete alignment
  • e.g. domain organisation (Interpro)

Key
CH
SH3
PI-PLC-X
SH2
PI-PLC-Y
rhoGEF
C2
DAG_PE-bind
PH
68
MACSIMS visualisation
JalView II, Coll. G. Barton
69
MACSIMS
BAliBASE reference 3 aldehyde dehydrogenase-like








70
(No Transcript)
71
(No Transcript)
72
Summary
  • Choice of multiple alignment method
  • traditional progressive method (e.g. clustalw /
    clustalx)
  • combined local and global method (e.g. mafft,
    muscle, dbclustal)
  • knowledge-based method (e.g. PipeAlign)
  • Web Server versus Local Installation ?

WARNING Automatic alignment methods can make
mistakes. Verify alignment quality by automatic
methods (e.g. norMD) and visual inspection !
  • Multiple alignment applications
  • Traditional applications
  • phylogeny
  • conserved residue / motif identification
  • Information in multiple alignments also
    improves accuracy in
  • sequence error detection
  • structure prediction
  • functional annotation

73
Laboratory of Integrative Genomics and
BioinformaticsIGBMC, Strasbourg
74
alternative algorithms
Iterative Refinement
PRRP (Gotoh, 1993) refines an initial progressive
multiple alignment by iteratively dividing the
alignment into 2 profiles and realigning them.
75
alternative algorithms
Genetic Algorithms
SAGA (Notredame, Higgins, 1996) evolves a
population of alignments in a quasi evolutionary
manner, iteratively improving the fitness of the
population
population n
select a number of individuals to be parents
modify the parents by shuffling gaps, merging 2
alignments etc.
population n1
evaluation of the fitness using OF (sum-of-pairs
or COFFEE)
END
76
alternative algorithms
HMM
  • Probabilistic model for sequence profiles,
    visualized as a finite state machine
  • For each column of the alignment a match state
    models the distribution of residues allowed
  • Insert and delete states at each column allow
    for insertion or deletion of one or more residues

Original profile HMM (Krogh et al, 1994)
E
AK
Y W
L L
D D
V
AKY-L-D --WVLED
77
Multiple Alignment using HMM
generate initial alignment (Baum-Welch
expectation maximization)
HMMER (Eddy, unpublished) SAM-T98 (Hughey, 1996)
produce a model
generate new alignment (Viterbi algorithm or
posterior decoding)
evaluate alignment (expectation maximization)
END
78
alternative algorithms
Segment-to-segment Alignment
Dialign (Morgenstern et al. 1996) compares
segments of sequences instead of single
residues 1. construct dot-plots of all possible
pairs of sequences
2. find a maximal set of consistent diagonals in
all the sequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq........
.......WWNAedsegkr.GMIPVPYVek.......... ........nl
FVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCE
Aqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpq
edgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..G
MFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKG
DEYFIleesnlp...............WWRArdkngqe.GYIPSNYVtea
eds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaid
eg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg........
.......WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyq
YRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLN
GynettgerGDFPGTYVeyigrkkisp..
Local alignment - residues between the diagonals
are not aligned
Write a Comment
User Comments (0)
About PowerShow.com