Title: Jacques'van'Heldenulb'ac'be
1Phylogeny
2Species trees versus molecule tree
- A species tree aims at representing the
evolutionary relationships between species. - A molecule tree represents the evolutionary
history of a family of related molecules (genes,
proteins). - Species trees and gene trees are generally
related ... - Species tree can be inferred from various
criteria, including the history of carefully
chosen molecules. - ... but not identical.
- A molecular family can contain several copies in
the same species (in-paralogs), due to gene
duplications. - Some molecules can be transferred horizontally
between species. - Due to combinations of duplications-divergences,
the tree of a given gene may be inconsistent with
the species tree. - Illustration Figure 7.3 from Zvelebil and Baum.
Source Zvelebil, M.J. and Baum, J.O. (2008)
Understanding Bioinformatics. Garland Science,
New York and London.
3Tree reconciliation
Source Zvelebil, M.J. and Baum, J.O. (2008)
Understanding Bioinformatics. Garland Science,
New York and London.
4Concept definitions from Fitch (2000)
- Discussion about definitions of the paper
- Fitch, W. M. (2000). Homology a personal view on
some of the problems. Trends Genet 16, 227-31. - Homology
- Owen (1843). the same organ under every variety
of form and function . - Fitch (2000). Homology is the relationship of any
two characters that have descendent, usually with
divergence, from a common ancestral character. - Note character can be a phenotypic trait, or a
site at a given position of a protein, or a whole
gene, ... - Molecular application two genes are homologous
if diverge from a common ancestral gene. - Analogy relationship of two characters that have
developed convergently from unrelated ancestor. - Cenancestor the most recent common ancestor of
the taxa under consideration - Orthology relationship of any two homologous
characters whose common ancestor lies in the
cenancestor of the taxa from which the two
sequences were obtained. - Paralogy Relationship of two characters arising
from a duplication of the gene for that
character. - Xenology relationship of any two characters
whose history, since their common ancestor,
involves interspecies (horizontal) transfer of
the genetic material for at least one of those
characters.
Analogy Homology Paralogy Xenology or not
(xeonologs from paralogs) Orthology Xenology or
not
5Exercise
- On the basis of Fitchs definitions (previous
slide), qualify the relationships between each
pair of genes in the illustrative schema. - P paralog
- O ortholog
- X xenolog
- A analog
- Orthologs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a speciation event (ex a1 and
a2). - Paralogs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a gene duplication event (ex
b2 and b2'). Source Zvelebil Baum, 2000
6Exercise
- Example B1 versus C1
- The two sequences (B1 and C1) were obtained from
taxa B and C, respectively. - The cenancestor (blue arrow) is the taxon that
preceded the second speciation event (Sp2). - The common ancestor gene (green dot) coincides
with the cenancestor - -gt B1 and C1 are orthologs
- Orthologs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a speciation event. - Paralogs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a gene duplication event. - Source Zvelebil Baum, 2000
7Exercise
- Example B1 versus C2
- The two sequences (B1 and C2) were obtained from
taxa B and C, respectively. - The common ancestor gene (green dot) is the gene
that just preceded the duplication Dp1. - This common ancestor is much anterior to the
cenancestor (blue arrow). - -gt B1 and C2 are paralogs
- Orthologs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a speciation event. - Paralogs can fomally be defined as a pair of
genes whose last common ancestor occurred
immediately before a gene duplication event. - Source Zvelebil Baum, 2000
8Solution to the exercise
- On the basis of Fitchs definitions (previous
slide), qualify the relationships between each
pair of genes in the illustrative schema. - P paralog
- O ortholog
- X xenolog
- A analog
9Cladistics, cladograms and clades
- Cladistics
- (Greek klados branch) is a branch of biology
that determines the evolutionary relationships
between organisms based on derived similarities
(source Wilkipaedia). - Cladogram
- tree-like drawing, usually with binary
bifurcations, representing one evolutionary
scenario about divergences between species or
sequences. - Clade
- Any sub-tree of a cladogram.
- Note branch lengths to not reflect evolutionary
time.
10Phylogram
- Phylogram tree-like structure representing an
evolutionary scenario, and including - the events of divergence between species or
sequences - the evolutionary time between each species and
the divergence events.
11Molecular clock
- The "molecular clock" hypothesis (left tree)
assumes that rates of evolution do not vary
between branches. All leaf nodes are thus aligned
vertically. - This hypothesis is not always valid
- in some cases, two genes can diverge from a
common ancestor, but one of them may have
diverged faster than the other one. This is a
rather classical mechanism of evolution a
duplication creates some redundancy, and one copy
of the gene will evolve whereas the other one
retains the initial function.
Ultrametric tree (with clock) (e.g. UPGMA)
Without clock (e.g. neighbour-joining)
12Phylogenetic inference from sequence comparison
Unaligned sequences
- Alternative approaches
- Maximum parsimony
- Distance
- Maximum likelihood
Sequence alignment
Aligned sequences
strong similarity ?
many (gt 20) sequences ?
Maximum parsinomy
no
yes
Source Mount (2000)
13Maximum parsimony
- For each column of the alignment, all possible
trees are evaluated and the tree with the
smallest number of mutations is retained - The trees which fit with the highest number of
columns are retained - The program can return several trees
Adapted from Mount (2000)
14Maximum parsimony example
- Parsimony tree calculated from a multiple
alignment of the E.coli proteins containing a
lacI-type HTH domain - Left text representation (protpars output)
- Bottom right visualized with njplot (in the
ClustalX distribution)
-----------CYTR_ECOLI ------------------
--------6 ! !
--------EBGR_ECOLI !
-13 !
! -----CSCR_ECOLI !
-12 !
! --IDNR_ECOLI !
--5 !
--GNTR_ECOLI --4 !
!
-----MALI_ECOLI ! !
-10 ! !
! ! --TRER_ECOLI ! !
--------------9 -14 ! ! !
! --YCJW_ECOLI ! !
! ! ! ! !
--------LACI_ECOLI !
--------------8 --2 !
--FRUR_ECOLI ! !
! -------15 ! ! !
! --RAFR_ECOLI ! !
----------11 ! !
! -----ASCG_ECOLI ! !
-----7 --1 !
! --GALS_ECOLI ! !
--3 ! !
--GALR_ECOLI ! !
! -----------------------------------------RBSR_
ECOLI ! -----------------------------------
---------PURR_ECOLI remember this is an
unrooted tree! requires a total of 4095.000
15Maximum parsimony - drawbacks
- Number of trees to evaluate increases
exponentially with the number of sequences. - Assumes that all sequences evolved at the same
rate (molecular clock hypothesis). - Only works for well conserved sequence families.
16Phylogenetic inference from sequence comparison
Unaligned sequences
- Alternative approaches
- Maximum parsimony
- Distance
- Maximum likelihood
Sequence alignment
Aligned sequences
strong similarity ?
many (gt 20) sequences ?
Maximum parsinomy
no
yes
no
no
clear similarity ?
Distance
yes
Source Mount (2000)
17Distance method
- Starting from a multiple alignment, calculate the
distance between each pair of sequences - Calculate a tree which fits as well as possible
with the distance matrix - branch lengths should correspond to distances
- rooted or unrooted
- Several methods can be used for calculating a
tree from the distance matrix. - Fitch-Margoliah
- Neighbour-Joining
- UPGMA
Aligned sequences
Distance calculation
Distance matrix
Tree calculation
Tree
18Distance matrix
- The distance matrix indicates the distance
between each pair of sequence. - The matrix is symmetrical, and the diagonal only
contains 0s.
19Trees
Rooted tree
Unrooted tree
Unrooted tree
- The distance between two nodes is the sum of
lengths of the branches between them
20Methods for calculating trees from a distance
matrix
- It is usually not possible to find a tree whose
branch length fit with all the values of the
distance matrix. - Several approaches exist to calculate a tree
which approximates the distances. - The Fitch-Margoliah method minimizes the sum of
squares between distances in the matrix and
distances in the tree. - The Neighbour-Joining (NJ) method minimizes the
sum of branch lengths for the resulting tree.
This methods does not assume a molecular clock
it is thus appropriate when some proteins
sequences have evolved faster than some other
ones. It returns an unrooted tree. - The Unweighted Pair-Group Method by arithmetic
Averaging (UPGMA) clusters the sequences by order
of distance in the distance matrix. This method
relies on the assumption of evolutionary clock,
and it produces a rooted tree.
21Example of phylogenetic tree
- This tree was obtained with the Neighbour-Joining
method (implemented in ClustalX). - The drawing was obtained with njplot (part of the
ClustalX package) - Each branch of the tree is labelled with the
distance.
22Distance-based methods for calculating trees in
the package PHYLIP
- Summary of the methods for calculating a tree
from a distance matrix.
23Bootstrapping
- In some cases, the data does not allow to infer
phylogeny - To assess the reliability of the inference, one
can apply the bootstrap method - Given an alignment of n sequences and p columns,
one performs a random selection of p columns,
with replacement. Some columns can thus be
selected multiple times, whilst some others are
not selected at all. - Calculate a tree with the sampled columns.
- Repeat many (e.g. 1000) times, and check whether
the same branches occur frequently (e.g. gt 70).
24Phylogenetic inference from sequence comparison
- Alternative approaches
- Maximum parsimony
- Distance
- Maximum likelihood
Unaligned sequences
Sequence alignment
Aligned sequences
strong similarity ?
many (gt 20) sequences ?
Maximum parsinomy
no
yes
no
no
clear similarity ?
Distance
yes
no
Maximum likelihood
Source Mount (2000)
25Practicals with phylogeny.fr
26Phylogeny.fr
- http//www.phylogeny.fr
- Offers a user-friendly interface to run all the
steps for inferring phylogeny from a set of
unaligned sequences. - Completely automated workflow or user-specified
parameters. - Alternative methods for each step of the
workflow. - Results are exported in multiple formats
(convenient for using them with other programs). - Results can be displayed immediately (for fast
programs) or sent by email (slow programs).
27Phylogeny.fr sequence input
- The one click option only requires for you to
enter a set of sequences, and click on the
submit button.
28Phylogeny.fr work flow
- At each step of the workflow, you can
- Check the parameters used for the analysis
- Choose alternative parameters (advanced use)
- Export the intermediate and final results in a
variety of formats, which can then be opened in
other programs.
29Phylogeny.fr - alignment result
30Phylogeny.fr - phylogenic tree in text format
31Phylogeny.fr - Phylogram (various output formats
are supported)
32Phylogeny.fr - display options
33Phylogram with an outgroup added (Bacillus) but
not correctly rooted (midpoint grouping)
34Cladogram incorrectly rooted (midpoint)
35Phylogram rooted with an outgroup
36Further reading
37Further reading
- Textbooks
- Zvelebil, M.J. and Baum, J.O. (2008)
Understanding Bioinformatics. Garland Science,
New York and London. - Mount, M. (2001) Bioinformatics Sequence and
Genome Analysis. Cold Spring Harbor Laboratory
Press, New York. - Pevzner, J. (2003) Bioinformatics and Functional
Genomics. Wiley. - all his teaching material on http//pevsnerlab.k
ennedykrieger.org/bioinfo_course.htm
38Supplementary material
39PHYLIP flowchart
Distance calculation protdist dnadist
Bootstrapping seqboot
aligned sequences
distance matrix
Parsimony protpars dnapars
Branch-and-bound dnapenny
Maximum likelihood dnaml protml
UPGMA neighbor (rooted)
Fitch-Margoliash fitch (unrooted) kitsch (rooted)
Neighbor -joining neighbor
tree
consense
retree
Tree drawing drawgram
Tree drawing drawtree
drawing of rooted tree
drawing of unrooted tree
40Taxonomy of bacteria having a gene metA (August
2004)
Bacteria
Bacillales
Bacillaceae
Bacillus
Firmicutes
Clostridia
Clostridiales
Clostridium
Lactococcus
Streptococcus
Brucella
Alpha subdivision
Rhizobiaceae group
Rhizobium
Sinorhizobium
Proteobacteria
Epsilon subdivision
Campylobacter group
Campylobacter
Escherichia
Enterobacteriaceae
Salmonella
Gamma subdivision
Yersinia
Vibrionaceae
Vibrio
Thermotogae
Thermotogae (class)
Thermotogales
Thermogata
41Tree nomenclature
- Node
- Leave
- Internal branch
- External branch
42Alignment methods
Source Zvelebil, M.J. and Baum, J.O. (2008)
Understanding Bioinformatics. Garland Science,
New York and London.
43Evolutionary model
Source Zvelebil, M.J. and Baum, J.O. (2008)
Understanding Bioinformatics. Garland Science,
New York and London.