Title: Introduction to phylogenetic/phylogenomic concepts and methods
1Introduction to phylogenetic/phylogenomic
concepts and methods
- With a brief discussion of phylogenomic methods
of species phylogeny estimation and orthology
prediction methods
2- "Nothing in Biology Makes Sense Except in the
Light of Evolution" -
- Theodosius Dobzhansky (1973)
3Revised tree of life including horizontal transfer
Horizontal transfer endosymbiosis
4Is Phylogenetic Analysis Truth?
- No -- its just a prediction, and like any other
prediction method, it is prone to errors of
different types. - Its important that you be aware of sources of
possible errors in a phylogenetic tree
reconstruction - However, phylogenetic reconstruction is one of
the best tools available for understanding both
how species evolve and how gene families evolve
novel functions and structures. - As Winston Churchill said
5Winston Churchill
http//www.winstonchurchill.org/
6Workflow for phylogenomic analysis
Sjölander, Phylogenomic inference of protein
molecular function advances and challenges
Bioinformatics (2004) 20 (2) 170-179.
7Delsuc et al, Phylogenomics and the
reconstruction of the tree of life Nature
Reviews Genetics 6, 361-375 (May 2005)
8Trees are a special type of graph
- Graphs have nodes (vertices) and edges (branches)
- Edges can be directed or undirected
- Nodes can be internal or terminal
- Terminal nodes in a phylogenetic tree are called
leaves (or taxa) - The term taxon refers to (groups of) species,
but is commonly used to describe genes in
multi-gene families, even when the same species
may be found in multiple copies in the tree - Trees are a special subtype of graph (acyclic
connected graphs) - The valency (or degree) of a node equals the
number of edges - A tree for which every internal node (except for
the root) has degree 3 (one ancestor and two
children) is called a bifurcating or binary
tree. - Trees for which internal nodes can have gt2
children are called multifurcating trees - The diameter of a tree is equal to the longest
path between two leaves (including edge lengths,
not simply number of edges) - Most phylogenetic trees are unrooted, and special
methods must be used to infer the root.
9Interpreting tree topologies
- Many phylogenetic trees are not meant to be
interpreted as rooted (more about this later) - Terminal nodes (leaves) represent contemporary
taxa (organisms, genes, proteins, or other
objects) - Internal nodes represent inferred ancestors
- Edge lengths are supposed to be proportional to
the evolutionary distance
10node
human
A clade
mouse
Fruit fly
Root?
Taxa (singular taxon)
Terminal nodes (leaves)
From Bioinformatics, A practical guide to the
analysis of genes and proteins Edited by
Baxevanis Ouellette
11Finding an optimal tree topology is unlikely!
- Number of (unrooted) binary trees on n leaves is
(2n-5)!! - If each tree on 1000 taxa could be analyzed in
0.001 seconds, we would find the best tree in - 2890 millennia
12Morphological vs. Molecular
- Classical phylogenetic analysis morphological
features - Presence of vertebra, number of legs, fur vs
scales, etc. - Molecular evolution Using molecular features
(DNA, RNA and proteins) - Analysis based on homologous sequences (e.g.,
globins) in different species - Additional issues
- restriction to orthologs is critical when
reconstructing species phylogenies. - If estimating a gene family phylogeny, youre
better off gathering all homologs and figuring
out orthologs from the tree topology. - Slow-evolving genes are best for reconstructing
phylogenies of distant species use protein
sequences - Fast-evolving genes (and use of nucleotide data)
may be more effective at reconstructing
phylogenies of closely related species
13Rooted vs unrooted trees
- Most trees are unrooted -- you need to take
additional actions to root the tree - Rooting a species tree not so hard
- Rooting a protein superfamily tree (with gene
duplication) hard
14Two main classes of tree inference methods
- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residues they disagree on, or
-alignment score between them, or - E.g., Neighbor-Joining, UPGMA, etc.
- Character-based
- Examine each character (e.g., residue) separately
- E.g., Maximum Likelihood, Maximum Parsimony, Mr.
Bayes - Question to test understanding
- Is SATCHMO a distance-based or character-based
method? What about SCI-PHY?
15Distance-based Phylogenetic Methods
T. Warnow
16Maximum Parsimony
- Input Set S of n aligned sequences of length k
- Output
- A phylogenetic tree T leaf-labeled by sequences
in S - additional sequences of length k labeling the
internal nodes of T - such that is minimized.
E(T) is the set of edges in the tree T H is the
Hamming distance (minimum substitutions) between
edges (i,j).
T. Warnow
17The fundamental principle behind maximum parsimony
- Occams Razor
- Entia non sunt multiplicanda praeter
necessitatem. - William of Occam (1300-1349)
The best tree is the one which requires the least
number of substitutions
18Maximum Likelihood
- Input Set S of n aligned sequences of length k,
and a specified parametric model - Output
- A phylogenetic tree T leaf-labeled by sequences
in S - With additional model parameters (e.g. edge
lengths) - such that PrS(T, parameters) is maximized.
- (Recall that ML methods seek to identify a model
that maximizes the probability of the data.)
19Maximum Likelihood (in English)
- Require a model of evolution
- Each substitution has an associated likelihood
given a branch of a certain length - A function is derived to represent the likelihood
of the data given the tree, branch-lengths and
additional parameters - Find the tree that maximizes the likelihood
20Molecular Clock
- UPGMA implicitly assumes that all distances are
clock-like
2
3
2
3
4
1
1
4
21More about molecular clocks
- The molecular clock (based on the molecular clock
hypothesis (MCH)) is a technique in genetics to
date when two species diverged. Elapsed time is
deduced by applying a time scale to the number of
molecular differences measured between the
species' DNA sequences or proteins.
http//en.wikipedia.org/wiki/Molecular_clock
22More about molecular clocks
- The notion of the existence of a so-called
"molecular clock" was first attributed to Emile
Zuckerkandl and Linus Pauling who, in 1962,
noticed that the number of amino acid differences
in hemoglobin between lineages scales roughly
with divergence times, as estimated from fossil
evidence. They generalized this observation to
assert that the rate of evolutionary change of
any specified protein was approximately constant
over time and over different lineages. - Later Allan Wilson and Vincent Sarich built upon
this work and the work of Motoo Kimura observed
and formalized that rare spontaneous errors in
DNA replication cause the mutations that drive
molecular evolution, and that the accumulation of
evolutionarily "neutral" differences between two
sequences could be used to measure time, if the
error rate of DNA replication could be calibrated.
http//en.wikipedia.org/wiki/Molecular_clock
23Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge)
FP
Courtesy of T. Warnow
24Two masking approaches
- Construct many MSAs for same set of sequences
(varying alignment parameters), concatenate
alignments, and estimate trees (W. Wheeler) - Expectation is that uncertain regions of
alignment will cancel each other out, whereas
signal from regions with consensus support will
be magnified - Delete noisy columns
- Options include
- Remove gappy columns only
- Remove gappy and noisy columns
25Rooting the tree (or not)
- Most tree methods produce unrooted trees (even if
they appear to be rooted) - In construction of species trees, use of an
outgroup sequence is recommended (and relatively
straightforward) - Choosing an outgroup for protein superfamily
reconstruction is not so straightforward - Alternatives
- Root the tree by finding the midpoint on the
longest path between any pair of leaves (assumes
molecular clock) - Root the tree by locating the root that requires
the minimal number of mutations from the ancestor
to current sequences
26Rooting the tree using an outgroup sequence
- Choosing an outgroup
- For species trees choose a related organism that
is more distant from any of the ingroup sequences
than they are to each other. - For protein family trees choose a related
protein that is evolutionarily more distant from
any ingroup sequence than any are to each other - (how to do this since we dont know the
evolutionary history??) - Place the root somewhere between the outgroup and
the rest (either on the node or in a branch)
27Bootstrap Analysis
- Columns from the input alignment are resampled
with replacement to create many bootstrap
replicate data sets - each resampled MSA will have the same number of
columns -- some columns may be sampled multiple
times, and others not at all - A phylogenetic tree is constructed for each
bootstrap replicate data set (e.g. with
parsimony, distance, ML etc.) - Agreement among the resulting trees is
summarized with a majority-rule consensus tree
(e.g., PHYLIP consense program) - Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups - Bootstrap analysis does not actually tell you the
expected accuracy of that subtree -- just the
support for that subtree in the alignment
28Phylogenetic uncertainty and subtree neighbors
If no known function(s) are available for a group
of sequences (e.g., subtree A, in the figure
above), function may be inferred based on data
available for sequences in a neighboring subtree.
(This idea is referred to as subtree neighbors
by Zmasek and Eddy.) However, the coarse
branching order between clades can be highly
variable. If you build four trees (NJ, ML, MP, Mr
Bayes) you may find four (or more!) different
tree topologies. Example SH2 domains. SH2
domains bind phosphopeptides the residues
responsible for binding are generally conserved
within subfamilies but vary across subfamilies.
In cases where we know something about the
function of members of a family, can we use this
information to evaluate which of the different
trees might be more reliable?
Src homology 2 domain 1SPSA
29Recommendations on phylogeny estimation
- Most important factor is the quality of the input
data - Look at the data from as many angles as possible.
Does this sound familiar?? - Choice of outgroup taxa can have as much
influence as the ingroup taxa. - Compute every analysis with several outgroups?
- Recognize that different programs can be
influenced by the order of the analysis, i.e. the
order in which the sequences appear in your file.
- PHYLIP and PAUP offer a "jumble" option that
reruns the analysis with different input orders.
http//biology.unm.edu/biology/maggieww/Public_Htm
l/phylogeny.htm
30Orthology prediction
31Major papers on orthology prediction
- Fitch, W.M. (2000) Homology a personal view on
some of the problems. Trends Genet, 16, 227-231. - Sonnhammer and Koonin, Orthology, paralogy and
proposed classification for paralog subtypes
Trends Genetics 2002 (Introduces key terms such
as inparalog) - Remm, M., Storm, C.E. and Sonnhammer, E.L. (2001)
Automatic clustering of orthologs and in-paralogs
from pairwise species comparisons. J Mol Biol,
314, 1041-1052. (InParanoid method) - Christian M Zmasek and Sean R Eddy (2002) RIO
Analyzing proteomes by automated phylogenomics
using resampled inference of orthologs, BMC
Bioinformatics (introduces the terms
super-ortholog and ultra-paralog) - Tatusov, R.L. et al. (2003) The COG database an
updated version includes eukaryotes. BMC
Bioinformatics, 4, 41. - Chen, F., Mackey, A.J., Stoeckert, C.J., Jr. and
Roos, D.S. (2006) OrthoMCL-DB querying a
comprehensive multi-species collection of
ortholog groups. Nucleic Acids Res, 34, D363-368. - Jensen, L.J., Julien, P., Kuhn, M., von Mering,
C., Muller, J., Doerks, T. and Bork, P. (2008)
eggNOG automated construction and annotation of
orthologous groups of genes. Nucleic Acids Res,
36, D250-254. - Ruan, J., Li, H., Chen, Z., Coghlan, A., Coin,
L.J., Guo, Y., Heriche, J.K., Hu, Y.,
Kristiansen, K., Li, R. et al. (2008) TreeFam
2008 Update. Nucleic Acids Res, 36, D735-740.
(phylogenomic identification of orthologs, but
restricted to animal genomes) - Ruchira S. Datta, Christopher Meacham, Bushra
Samad, Christoph Neyer and Kimmen Sjölander
(2009) "Berkeley PHOG PhyloFacts Orthology Group
Prediction Web Server," Nucleic Acids Research
(2009 Web server issue)
32Other papers about orthologs
- Greg Petsko "Homologuephobia", Genome Biology
2001 - Eugene Koonin, "An Apology for Orthologs - or
brave new memes", Genome Biology 2001 - R A Jensen "Orthologs and Paralogs - we need to
get it right", Genome Biology 2001 (includes a
discussion of conflicting opinions by Eugene
Koonin and Greg Petsko on this subject) - Iddo Friedberg, "Automated Function Prediction
the Genomic Challenge", Briefings in
Bioinformatics, 2006 (includes a discussion of
how function prediction methods are evaluated)
33Why is orthology important?
34Orthology definitions
Orthology (standard definition) the MRCA must
correspond to a speciation event. (By this
definition, the yeast sequence is orthologous to
both H1 and H2, which are co-orthologs to the
yeast sequence.) Super-orthology is more
restrictive than orthology all nodes on a path
between two leaves must correspond to a
speciation event. (Zmasek Eddy, 2002)
S
Super-orthologs
D
H1 C1 M1 R1 F1 W1 H2 C2 M2 R2
F2 W2
Yeast
Advantages of super-orthology it is transitive,
and provides a secure basis for inference of
function. Disadvantages highly restrictive (low
coverage)
Human, Chimp, Mouse, Rat, Fly, Worm
35New terms
- Co-ortholog
- Inparalog
- Outparalog
Sonnhammer and Koonin, Orthology, paralogy and
proposed classification for paralog subtypes,
Trends in Genetics, 2002
36Orthologs and paralogs - we need to get it right,
Roy A Jensen, Genome Biology 2001
37Gene tree-species tree reconciliation is used to
identify orthologs
38Ortholog prediction accuracy
PHOG-O Standard orthology definition PHOG-S
Super-orthologs PHOG-T thresholded
PHOGs PHOG-T(M) optimized for mouse PHOG-T(Z)
optimized for zebrafish PHOG-T(F) optimized for
fruit fly
Benchmark dataset 100 (non-homologous) human
sequences from TreeFam-A, filtered to remove
homologs, and manually curated orthologs from
mouse, zebrafish and fruit fly.
39PHOG search results genes can belong to
multiple orthology groupsbased on regions
selected, or time points at which sequences were
clustered
40How is PHOG different from InParanoid, OrthoMCL
and TreeFam?
- PHOG is based on precalculated trees in
PhyloFacts 55K protein families (and expanding
daily) - PhyloFacts includes trees for individual domains
as well as entire domain architectures - PhyloFacts trees include sequences from across
the Tree of Life (all species, not restricted to
whole genomes) - TreeFam is restricted to animal genomes
- TreeFam uses gene tree-species tree
reconciliation (PHOG does not) - This should enable TreeFam to be more accurate,
assuming the species tree is correct (note this
cannot always be assumed) - Advantages of PHOG
- Individual domains provide different evolutionary
perspectives - Improved taxonomic sampling (known to be critical
for tree topology accuracy) - Requiring global homology is often too
restrictive and can cause topology errors - Allowing local matches avoids problems with
rejecting sequences due to gene model errors
(very common among eukaryotes)
41Relevant questions
- What are the uses of orthologs?
- What is the (accepted/common) definition of
ortholog and paralog? - What is the most common method used to identify
orthologs? Is this approach rigorous? - How can you be sure you have an ortholog? (What
evidence is required to assert two proteins are
orthologs?) - What is the actual significance of two proteins
being orthologous? - How is orthology accuracy assessed? Is there a
standard? - What is the expected accuracy of ortholog
identification using different methods?