FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES - PowerPoint PPT Presentation

About This Presentation
Title:

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES

Description:

the more you know about the compared proteins and the method used, ... Methods in Enzymology 224, 456-487. 1. 2. 3. 4. 5. Align Sequences. Phylogenetic signal? ... – PowerPoint PPT presentation

Number of Views:240
Avg rating:3.0/5.0
Slides: 48
Provided by: rober210
Category:

less

Transcript and Presenter's Notes

Title: FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES


1
FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES
Simon Harris Wellcome Trust Sanger Institute, UK
2
Agenda
  • Remind you that molecular phylogenetics is
    complex
  • the more you know about the compared proteins and
    the method used, the better
  • Try to avoid the black box approach as much as
    possible!
  • Give an overview of some phylogenetic methods and
    software used with protein alignments - some
    practical issues

3
From DNA/protein sequences to trees
1
Sequence data
2
Align Sequences
Phylogenetic signal? Patternsgtevolutionary
processes?
3
Distance methods
Character based methods

Distance calculation (which model?)
4
Choose a method
MB
ML
MP
Weighting? (sites, changes)?
Model?
Model?
Single tree
Optimality criterion
LS
ME
NJ

Calculate or estimate best fit tree

5
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in
Enzymology 224, 456-487
4
Phylogenies from proteins
  • Parsimony
  • Distance matrices
  • Maximum likelihood
  • Bayesian methods




5
1 2 3 4 5 A G G G C G B A G G C T C A T T C
T D A T T T G
Characters based methods
Distance methods
Distance matrix
A B C D A - d1 d2 d3 B - d4 d5 C
- d6 D -
MP ML BM
Explicit model of sequence evolution
B
C
D
A
6
Phylogenetic trees from protein alignments
  • Distance methods - model for distance estimation
  • Simple formula (e.g. Kimura, use of Dij, LogDet)
  • Complex models
  • Probability of amino acid changes - Mutational
    Data Matrices
  • Site rate heterogeneity
  • Maximum likelihood and Bayesian methods- MDM
    based models are used for lnL calculations of
    sites -gt lnL of trees
  • Site rate heterogeneity
  • Homogenous versus heterogeneous models
  • Estimations of data specific rate matrices (amino
    acid groupings - GTR like)

7
Software an overview
  • CLUSTALXv2 - kimura distances
  • PHYLIPv3.62 - distance, MP, and ML methods (and
    more)
  • Some complex protein models
  • PAM, JTT site rate heterogeneity
  • Bootstrapping - bootstrap support values
  • TREE-PUZZLEv5.2 - distance and a ML method
  • ML - quartet method
  • Complex protein models
  • JTT, WAGmatrices site rate heterogeneity
  • From quartets to n-taxa tree - PUZZLE support
    values
  • Some sequence statistics - aa frequency and
    heterogeneity between sequences
  • Tree comparisons - KH test
  • PHYML - ML methods (protein and DNA)
  • MRBAYES - Bayesian
  • Complex protein models
  • JTT, WAGmatrices site rate heterogeneity
  • Data partitioning
  • Posteriors as support values
  • PHYLOBAYESv2.3

8
PHYLIP3.62
  • Protpars parsimony
  • Protdist models for distance calculations
  • PAM1, JTT, Kimura formula (PAM like), others...
  • Correction for rate heterogeneity between sites!
    Removal of invariant sites? (not estimated, see
    TREE-PUZZLE5.2!)
  • NJ and LS distance trees ( molecular clock)
  • Proml protein ML analysis (no estimation of site
    rate heterogeneity - see TREE-PUZZLE5.2)
  • Coefficient of variation (CV) versus alpha shape
    parameter CV1/alpha1/2
  • Bootstrapping

9
Distance methods
  • A two step approach - two choices!
  • 1) Estimate all pairwise distances
  • Choose a method (100s) - has an explicit model
    for sequence evolution
  • Simple formula
  • Complex models - PAM, JTT, site rate variation
  • 2) Estimate a tree from the distance matrix
  • Choose a method with (ME, LS) or without an
    optimality criterion (NJ)?

10
Simple and complex models
  • dij -Ln (1 - Dij - (Dij2/5)) (Kimura)
  • Simple and fast but can be unreliable -
    underestimates changes, hence distances, which
    can lead to misleading trees - PHYLIP, CLUSTALX
  • Dij is the fraction of residues that differs
    between sequence i and j (Dij 1 - Sij)
  • dij ML P(n), (G, pinv), Xij (bad
    annotation!)
  • ML is used to estimate the dij based on the
    sequence alignment and a given model - MDM, gamma
    shape parameter and pinv - PHYLIP, PUZZLE. Each
    site is used for the calculation of dij, not just
    the Dij value.
  • More realistic complexity in relation to protein
    evolution and the subtle patterns of amino acid
    exchange rates
  • Note the values of the different parameters
    (alphapinv) have to be either estimated, or
    simply chosen (MDM), prior the dij calculations

11
1) Choosing/estimating the parameter of a model
  • 1) Mutation Data Matrices PAM, JTT, WAG
  • What are the properties of the protein alignment
    ( identity, amino acid frequencies, globular,
    membrane)?
  • Can be corrected for the specific dataset amino
    acid frequencies (-F) - in some software only
  • Compare ML of different models for a given data
    and tree
  • ModelGenerator and ProtTest are designed for this
  • 2) Alpha and pInv values have to be estimated on
    a tree
  • TREE-PUZZLE can do that. Reasonable trees give
    similar values

12
2) Inferring phylogenetic trees from the
estimated dij
  • a) Without an optimality criterion
  • Neighbor-joining (NJ) (NEIGHBOR)
  • Different algorithms exist - improvement of the
    computing
  • If the dij are additive, or close to it, NJ
    will find the ME tree
  • BIONJ, WEIGBOR, FastME
  • b) With an optimality criterion
  • Least squares (FITCH)
  • Minimum evolution (in PAUP - now also PHYLIP)

13
Fitch Margoliash Method 1968
  • Seeks to minimise the weighted squared deviation
    of the tree path length distances from the
    distance estimates -uses an objective function

E the error of fitting dij to pij T number of
taxa if a 2 weighted least squares wij the
weighting scheme
dij F(Xij) pairwise distances estimate - from
the data using a specific model (or simply
Dij) pij length of path between i and j implied
on a given tree dij pij for additive datasets
(all methods will find the right tree)

14
  • Minimum Evolution Method
  • For each possible alternative tree one can
    estimate the length of each branch from the
    estimated pairwise distances between taxa (using
    the LS method) and then compute the sum (S) of
    all branch length estimates. The minimum
    evolution criterion is to choose the tree with
    the smallest S value

With Vk being the length of the branch k on a tree
15
Distance methods
  • Advantages
  • Can be fast (NJ)
  • Some distance methods (LogDet) can be superior to
    more complex approached (ML) in some conditions
    (shown for DNA alignments)
  • Distance trees can be used to estimate parameter
    values for more complex models and then used in a
    ML method
  • Provides trees with branch lengths
  • Disadvantages
  • Can lose information by reducing the sequence
    alignment into pairwise distances
  • Can produce misleading (like any method) trees in
    particular if distance estimates are not
    realistic (bad models), deviates from additivity

16
Character based methods
  • Maximum likelihood based methods
  • Quartet puzzling method - TREE-PUZZLE
  • Standard ML - PHYML, PROML (PHYLIP)
  • Bayesian based methods
  • MrBayes v3.1
  • Phylobayes v2.3
  • P4 (Peter Foster)

17
1 2 3 4 5 A G G G C G B A G G C T C A T T C
T D A T T T G
Characters based methods
Distances methods
Distance matrix
A B C D A - d1 d2 d3 B - d4 d5 C
- d6 D -
MP ML BM
Explicit model of sequence evolution
B
C
D
A
18
TREE-PUZZLE5.2
  • Protein maximum likelihood method using quartet
    puzzling
  • With various protein rate matrices (JTT, WAG)
  • Can include correction for rate heterogeneity
    between sites - pinv gamma shape (can estimates
    the values)
  • Can estimate amino acid frequencies from the data
  • List site rates categories for each site (2-16)
  • Composition statistics
  • Molecular clock test
  • Can deal with large datasets
  • Can be used for ML pairwise distance estimates
    with complex models - used with puzzleboot to
    perform bootstrapping with PHYLIP

19
A gamma distribution can be used to model site
rate heterogeneity
Yang 1996 TREE, 11, 367-372
20
TREE-PUZZLE5.2
The quartet ML tree search method has four steps
  • 1) Parameters (pInv-gamma) are estimated on a NJ
    n-taxa tree
  • 2) Calculate the ML tree for all possible
    quartets (4-taxa)
  • 3) Combine quartets in a n-taxa tree (puzzling
    step)
  • 4) Repeat the puzzling step numerous times (with
    randomised order of quartet input)
  • 5) Compute a majority rule consensus tree
  • from all n-trees - has the puzzle support
    value

Puzzle support values are not bootstrap values!
21
(No Transcript)
22
TREE-PUZZLE5.2
  • Models for amino acid changes
  • PAM, JTT, BLOSUM64, mtREV24, WAG (with correction
    for amino acid frequencies)
  • Correction for specific dataset amino acid
    frequencies
  • Discrete gamma model for rate heterogeneity
    between sites 4-16 categories.
  • -gt output gives the rate category for each site.
    Can be used to partition your data and analyse
    them separately
  • Taxa composition heterogeneity test
  • Molecular clock test

23
Combination of categories that contributes the
most to the likelihood (computation done without
clock assumption assuming quartet-puzzling
tree) 663480606551131631680164026401551353517555
877778857576788666 3026013700563377575867841887403
66735872783378410002242458471 83622611151658686136
640100010026331405414845653410774876588
24
TREE-PUZZLE5.2
  • Can be used to calculate pairwise distances with
    a broad diversity of models - puzzleboot (Holder
    Roger)
  • Can be used in combination with PHYLIP programs
    for bootstrapping
  • SEQBOOT
  • NJ or LS
  • CONSENSE
  • But PHYML can do ML bootstrapping in a fair
    amount of time

25
TREE-PUZZLE5.2
  • Advantages
  • Can handle larger numbers of taxa for maximum
    likelihood analyses
  • Implements various models (BLOSUM, JTT, WAG) and
    can incorporate a correction for rate
    heterogeneity (pinvgamma)
  • Can estimate for a given tree the gamma shape
    parameter and the fraction of constant sites and
    attribute to each site a rate category
  • Disadvantages
  • Quartet based tree search - amplification of the
    long branch attraction artefact within each
    quartet analysis?

26
MrBayes 3.1
  • Bayesian approach
  • Iterative process leading to improvement of trees
    and model parameters and that will provide the
    most probable trees (and parameter values)
  • Complex models for amino acid changes
  • PAM and JTT, WAG (with correction for amino acid
    frequencies, but you have to type it!?!?!)
  • Correction for rate heterogeneity between sites
    (pinv, discrete gamma, site specific rates)
  • Powerful parameter space search
  • Tree space (tree topologies)
  • Shape parameter (alpha shape parameter, pinv)
  • Can work with large dataset
  • Provides probabilities of support for clades
    (posterior probabilities)

27
MrBayes 3.1
  • MrBayes will produce a population of trees and
    parameter values - obtained by a Markov chain
    (mcmcmc). If the chain is working well these will
    have converged to probable values
  • In practice we plot the results of an mcmcmc to
    determine the region of the chain that converged
    to probable values. The burn in is the region
    of the mcmcmc that is ignored for calculation of
    the consensus tree
  • Trees and parameter values from the region of
    equilibrium are used to estimate a consensus tree
  • The number of trees recovering a given clade
    corresponds to the posterior for that clade, the
    probability that this clade exists
  • The mcmcmc uses the lnL function to compare trees
    between generations

28
MrBayes 3.1
  • Most methods provide a single tree and parameters
    value
  • Bootstrapping provide a distribution of tree
    topologies
  • Puzzling steps also provides a distribution tree
    topologies
  • Bootstrap values - Puzzle support values -
    Posteriors values ???
  • But not to sure how to interpret these
    different support values - in each case the
    support values are for a given dataset and method
    used
  • Posteriors are typically higher then bootstrap
    and puzzle support values?!

29
MrBayes 3.1 some options
30
MrBayes 3.1 an example
NEXUS begin data dimensions ntax8
nChar500 format datatypeprotein gap-
missing? matrix Etc Begin mrbayes log
start filenamed.res.nex.log replace prset
aamodelprfixed(wag) lset ratesinvgamma
Ngammacat4 set autocloseyes mcmc ngen5000
printfreq500 samplefreq10 nchains4
savebrlensyes startingtreerandom
filenamed.res.nex.out quit end Begin
mrbayes log start filenamed.res.nex.con.log
replace sumt filenamed.res.nex.out.t
burnin150 contypeallcompat end
Block I
Block II
31
(No Transcript)
32
  • A Bayesian analysis
  • Propose a starting tree topology and parameters
    values (branch length, alpha, pinv), calculate
    lnL
  • Change one of these and compare the lnL with
    previous proposal
  • If the lnL is improved accept it
  • If not, accept it only sometimes
  • Do many of these
  • Plot the change of lnL in relationship to the
    number of generations run
  • Determine the region where the chain converged
    and calculate the consensus tree for that region
  • -gt consensus tree with posteriors for clade
    support

Tree lnL
Zooming in
Tree lnL
generations (mcmcmc)
33
alpha
  • Burn in determines the trees to be ignored for
    consensus tree calculation
  • Was the chain run long enough?
  • Do we get the same result from an independent
    chain?

pinv
generations (mcmcmc)
34
Consensus tree with a burn in of 1500
(150) Showing posterior values for the different
clades - probability for a given clade to be
correct (for the given data and method used!!!)
-----------------------A ---------------------
-----B ---------------------
--C ---------(0.98)
-------------------------D
---------(0.99) --------(0.49)
----------------------E
-----------------------F
---------(0.96)
------------------------G
--------(0.81)
--------------------------H
35
Model choice in protein analyses
  • Rate matrix choice (20x20 matrices)
  • WAG, BLOSSUM62, etc
  • Recoding protein datasets
  • 20x20 --gt 6x6 rate matrix (or else)
  • Implemented in P4 and PhyloBayes v2.3

36
Effect of using different rate matrices on
phylogenetics
PHYML MtRev matrix
PHYML WAG matrix
Keane et al. 2006
37
  • Numerous eukaryotes do not possess mitochondria
  • They possess instead hydrogenosomes or mitosomes
  • What is the evolutionary origin of these
    organelles and what is their function?

38
Trichomonas NuoF localises in the hydrogenosomes
Complex I
News and views by Gray (2005). Nature 434, 29-30.
39
Amino acids categories - recoding in p4
  • Sulfhydryl C (1)
  • Smallhydrophilic S, T, A, P, G (2)
  • Acid,amide D, E, N, Q (3)
  • Basic H, R, K (4)
  • Smallhydrophobic M, I, L, V (5)
  • Aromatic F, Y, W (6)

1 2 3 4 5 6 1- x1 x2 x3 x4 x5 2-
x6 x7 x8 x9 3- x10 x11 x12 4-
x13 x14 5- x15 6-
Recoding into 6 states (1-6) allows the
estimation of a GTR like matrix with 14 free
parameters
40
Why recode amino acids?
  • Potential advantages
  • Allows generation of a rate matrix specific for
    the investigated alignment
  • Contributes to mitigating amino acid composition
    heterogeneity and homoplasy due to frequent
    changes within categories - equivalent to DNA
    transversion analyses
  • Potential disadvantage
  • Loss of potential useful signal by reducing the
    alphabet from 20 to 6 letters (or else)

41
Recoding effect on NuoF phylogeny
GTRrecodedDayhoff classes (6x6) pInvG
WAG (20x20) pInvG
1.0
0.92
1.0
1.0
a-proteobacteria
1.0
0.96
0.8
1.0
1.0
42
Summary
  • No single program allows thorough phylogenetic
    analyses of protein alignments
  • Combination of PHYLIPv3.6, TREE-PUZZLEv5.2,
    PHYML, MrBAYESv3.1, PHYLOBAYESv2.3 and P4 allow
    detailed protein phylogenetics
  • Experimenting with your data and available
    methods/models can lead to interesting and
    biologically relevant results (data lt-gt method)
  • Incorporate site rate heterogeneity correction in
    the model or reduce heterogeneity by data editing
    (with and without invariant sites?)
  • Partitioning of the alignment (variant - various
    rates, invariant sites, secondary structure,
    protein domains)
  • Amino acid groupings (6 categories - GTR like)
  • LogDet for proteins - rare/absent changes? For
    long alignments?
  • DNA based LogDet or the protein alignment?
  • Do not take support values as absolute. Any
    support value is for a given method and data,
    only!

43
outside
PM
inside
TM domains
  • TM domain have very specific structural
    requirements AA composition from TM domain is
    very distinct from non TM domains!
  • Extracellular and intracellular domains may also
    have important functional differences --gt
    different functional constrains can lead to
    different AA composition.
  • More generally in any protein, surface exposed
    AA composition is typically distinct from
    internal AA.

44
outside
PM
inside
Global alignment (AA)
Sub-alignment 1
B
C
Model 1
D
A
Sub-alignment 1
B
C
Model 2
D
A
45
outside
PM
inside
Partition 1
Partition 2
Model 1
Model 2
B
C
D
A
46
outside
PM
inside
Global alignment (AA)
AA recoding, can mitigate compositional
difference between domains
  • Sulfhydryl C (1)
  • Smallhydrophilic S, T, A, P, G (2)
  • Acid,amide D, E, N, Q (3)
  • Basic H, R, K (4)
  • Smallhydrophobic M, I, L, V (5)
  • Aromatic F, Y, W (6)

B
C
D
A
47
outside
PM
inside
Global alignment (AA)
Global alignment (DNA)
DNA based models LogDet? Codon 1,2 or 3?
B
C
D
A
Write a Comment
User Comments (0)
About PowerShow.com