Title: Integrating Biological Information In Multiple Sequence Alignments
1 Integrating Biological Information In
Multiple Sequence Alignments
- Confronting Bits and Pieces of Information
Cédric Notredame CNRS-Marseille,
France www.tcoffee.org
2Manguel M, Samaniego F.J., Abraham Walds Work
on Aircraft Suvivability, J. American
Statistical Association. 79, 259-270, (1984)
3What s in a Multiple Sequence Alignment (I)
Evolution Inertia Common Ancestry Shows up In
the sequences
Selection Important Features Are Preserved
Functional Constraint Same Function Same
Sequence Convergence
Phylogenetic Footprint, Evolutionary Trace
4Why So Much Interest For Multiple Alignments ?
Extrapolation
Structure Prediction
Motifs/Patterns
SNP Analysis
Profiles
Regulatory Elements
Phylogeny
Reactivity Analysis
5Whats in a Multiple Alignment (II)?
- The MSA contains what you put inside
- Structural Similarity
- Evolutive Similarity
- Sequence Similarity
- You can view your MSA as
- A record of evolution
- A summary of a protein family
- A collection of experiments made for you by
Nature
6Building and Using Models
35.67 Angstrom
7Computing the Correct Alignment is a Complicated
Problem
8Stochastic Optimization
9Stochastic Optimization
- Exploration of Complex Optimization Problems With
Multiple Constraints - Genomic Alignments
- RNA Alignments
- Generation of Population of Suboptimal Solutions
- Qualityf( optimality )
- Specification of Concistency Objective Function
of T-Coffee
10Three Types of Algorithms
- Progressive ClustalW
- Iterative Muscle
- Concistency Based T-Coffee and Probcons
11ClustalW The Progressive Algorithm
12T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE
FAST CA-T --- SeqC GARFIELD THE VERY FAST
CAT SeqD -------- THE ---- FA-T CAT
13T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT Prim. Weight
88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD
THE LAST FA-T CAT Prim. Weight 77 SeqC
GARFIELD THE VERY FAST CAT SeqA GARFIELD THE
LAST FAT CAT Prim. Weight 100 SeqD --------
THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT
Prim. Weight 100 SeqC GARFIELD THE VERY FAST
CAT SeqC GARFIELD THE VERY FAST CAT Prim.
Weight 100 SeqD -------- THE ---- FA-T CAT
14T-Coffee and Concistency
15T-Coffee and Concistency
16T-Coffee and Concistency
17T-Coffee and Concistency
18T-Coffee and Concistency
19T-Coffee and Concistency
- Each Library Line is a Soft Constraint (a wish)
- You cant satisfy them all
- You must satisfy as many as possible (The easy
ones)
20T-Coffee and Concistency
21Concistency Based Algorithms T-Coffee
- Gotoh (1990)
- Iterative strategy using consistency
- Martin Vingron (1991)
- Dot Matrices Multiplications
- Accurate but too stringeant
- Dialign (1996, Morgenstern)
- Concistency
- Agglomerative Assembly
- T-Coffee (2000, Notredame)
- Concistency
- Progressive algorithm
22How Good Is My Method ?
23Structures Vs Sequences
24Validation Using BaliBase
25Too Many Methods for ONE AlignmentM-Coffee
26(No Transcript)
27Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
28Comparing Methods
MAFFT
29(No Transcript)
30(No Transcript)
31Estimating the Accuracy of your MSA
32What To Do Without Structures
33Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
34What To Do Without Structures
35When Sequences Are not Enough3D-Coffee and
Expresso
363D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
373D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
38Expresso Finding the Right Structure
Why Not Using Structure Based Alignments
39Expresso Finding the Right Structure
Sources
BLAST
BLAST
SAP
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
403D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
41Template Based Multiple Sequence Alignments
42Template Based Multiple Sequence Alignments
Sources
-Structure -Profile -
Template Aligner
-Structure -Profile -
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
43Method Score Templates Prefab Homstrad
-------------------------------------------------
------------- ClustalW Matrix ---- 61.80 ----
Kalign Matrix ---- 63.00 ---- MUSCLE Matrix
---- 68.00 45.0 --------------------------------
------------------------------ T-Coffee Consisten
cy ---- 69.97 44.0 ProbCons Consistency ---- 7
0.54 ---- Mafft Consistency ---- 72.20 ---- M-
Coffee Consistency ---- 72.91 ---- MUMMALS Consi
stency ---- 73.10 ---- -------------------------
------------------------------------- Clustal-db
Matrix Profiles ---- ---- PRALINE Matrix Profi
les ---- 50.2 PROMALS Consistency Profiles 79.00
---- SPEM Matrix Profiles 77.00 ---- ---------
--------------------------------------------------
--- EXPRESSO Consistency Structures ---- 71.9
T-Lara Consistency Structures ---- ---- ------
--------------------------------------------------
------ Table 1. Summary of all the methods
described in the review. Validation figures were
compiled from several sources, and selected for
the compatibility. Prefab refers to some
validation made on Prefab Version 3. The HOMSTRAD
validation was made on datasets having less than
30 identity. The source of each figure is
indicated by a reference. The EXPRESSO figure
comes from a slightly more demanding subset of
HOMSTRAD (HOM39) made of sequences less than 25
identical.
44Improving The Evaluation
45How Do We Perform In The Twilight Zone?
- Concistency Based Methods Have an Edge
- Hard to tell Methods Apart
- Sequence Alignment is NOT solved
46More Than Structure based Alignments
- Structural Correctness Is Only the Easy Side of
the Coin. - In practice MSA are intermediate models used to
generate other models
47Conclusion
- Template based Multiple Sequence Alignments
- Projecting any relevant information onto the
sequences - Using this Information
- Need for new evaluation procedures
- Functional Analysis
- Phylogenetic Analysis
- Homology Search (Profiles)
- Homology Modelling
- Integrating data ? Making sure your bits of data
can fight with one another
48- Fabrice Armougom (CNRS, FR)
- Sebastien Moretti (CNRS, FR)
- Olivier Poirot (CNRS, FR)
- Frederic Reinier (CRS4, IT)
- Karsten Suhre (CNRS, FR)
- Vladimir Saudek (Sanofi-Aventis, FR)
- Des Higgins (UCD, IE)
- Orla OSullivan (UCD, IE)
- Iain Wallace (UCD, IE)
- Victor Jongeneel (SIB/VitalIT, CH)
- Bruno Nyfler (VitalIT, CH)
- Roger Hersch (EPFL, CH)
- Pierre Dumas (EPFL, CH)
- Basile Schaeli (EPFL, CH)
www.tcoffee.org cedric.notredame_at_europe.com
49www.tcoffee.org cedric.notredame_at_europe.com
50(No Transcript)
51Turning Data into Models
- Data
- Columbus, considered that the landmass occupied
225, leaving only 135 of water (Marinus of
Tyre, 70 AD). - Columbus believed that 1 represented only 56
miles (Alfraganus, XIth century) -
- He knew there was an island named Japan off the
cost of China - Model
- Circumference of the Earth as 25,255 km at most,
- Canary Island to Japan 3,700 km (Reality
12,000 km.)
52The More Structures The Merrier
Average Improvement over T-Coffee
Struc/Seq Ratio
53The Right Mixt of Methods
543D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
55Applications
56Looking-Up The DNA Behind The Sequences PROTOGENE
57SAR Analysis
- Correlate Alignment Variations with Reactivity
- Application to the Human Kinome
- Collaboration with Sanofi-Aventis
- Main Issue
- Training problem ? Proper Benchmarking