Introduction to phylogenetic/phylogenomic concepts and methods - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Introduction to phylogenetic/phylogenomic concepts and methods

Description:

With a brief discussion of phylogenomic methods of species phylogeny estimation and orthology prediction methods – PowerPoint PPT presentation

Number of Views:376
Avg rating:3.0/5.0
Slides: 42
Provided by: makanasta
Category:

less

Transcript and Presenter's Notes

Title: Introduction to phylogenetic/phylogenomic concepts and methods


1
Introduction to phylogenetic/phylogenomic
concepts and methods
  • With a brief discussion of phylogenomic methods
    of species phylogeny estimation and orthology
    prediction methods

2
  • "Nothing in Biology Makes Sense Except in the
    Light of Evolution"
  • Theodosius Dobzhansky (1973)

3
Revised tree of life including horizontal transfer
Horizontal transfer endosymbiosis
4
Is Phylogenetic Analysis Truth?
  • No -- its just a prediction, and like any other
    prediction method, it is prone to errors of
    different types.
  • Its important that you be aware of sources of
    possible errors in a phylogenetic tree
    reconstruction
  • However, phylogenetic reconstruction is one of
    the best tools available for understanding both
    how species evolve and how gene families evolve
    novel functions and structures.
  • As Winston Churchill said

5
Winston Churchill
http//www.winstonchurchill.org/
6
Workflow for phylogenomic analysis
Sjölander, Phylogenomic inference of protein
molecular function advances and challenges
Bioinformatics (2004) 20 (2) 170-179.
7
Delsuc et al, Phylogenomics and the
reconstruction of the tree of life Nature
Reviews Genetics 6, 361-375 (May 2005)
8
Trees are a special type of graph
  • Graphs have nodes (vertices) and edges (branches)
  • Edges can be directed or undirected
  • Nodes can be internal or terminal
  • Terminal nodes in a phylogenetic tree are called
    leaves (or taxa)
  • The term taxon refers to (groups of) species,
    but is commonly used to describe genes in
    multi-gene families, even when the same species
    may be found in multiple copies in the tree
  • Trees are a special subtype of graph (acyclic
    connected graphs)
  • The valency (or degree) of a node equals the
    number of edges
  • A tree for which every internal node (except for
    the root) has degree 3 (one ancestor and two
    children) is called a bifurcating or binary
    tree.
  • Trees for which internal nodes can have gt2
    children are called multifurcating trees
  • The diameter of a tree is equal to the longest
    path between two leaves (including edge lengths,
    not simply number of edges)
  • Most phylogenetic trees are unrooted, and special
    methods must be used to infer the root.

9
Interpreting tree topologies
  • Many phylogenetic trees are not meant to be
    interpreted as rooted (more about this later)
  • Terminal nodes (leaves) represent contemporary
    taxa (organisms, genes, proteins, or other
    objects)
  • Internal nodes represent inferred ancestors
  • Edge lengths are supposed to be proportional to
    the evolutionary distance

10
node
human
A clade
mouse
Fruit fly
Root?

Taxa (singular taxon)
Terminal nodes (leaves)
From Bioinformatics, A practical guide to the
analysis of genes and proteins Edited by
Baxevanis Ouellette
11
Finding an optimal tree topology is unlikely!
  • Number of (unrooted) binary trees on n leaves is
    (2n-5)!!
  • If each tree on 1000 taxa could be analyzed in
    0.001 seconds, we would find the best tree in
  • 2890 millennia

12
Morphological vs. Molecular
  • Classical phylogenetic analysis morphological
    features
  • Presence of vertebra, number of legs, fur vs
    scales, etc.
  • Molecular evolution Using molecular features
    (DNA, RNA and proteins)
  • Analysis based on homologous sequences (e.g.,
    globins) in different species
  • Additional issues
  • restriction to orthologs is critical when
    reconstructing species phylogenies.
  • If estimating a gene family phylogeny, youre
    better off gathering all homologs and figuring
    out orthologs from the tree topology.
  • Slow-evolving genes are best for reconstructing
    phylogenies of distant species use protein
    sequences
  • Fast-evolving genes (and use of nucleotide data)
    may be more effective at reconstructing
    phylogenies of closely related species

13
Rooted vs unrooted trees
  • Most trees are unrooted -- you need to take
    additional actions to root the tree
  • Rooting a species tree not so hard
  • Rooting a protein superfamily tree (with gene
    duplication) hard

14
Two main classes of tree inference methods
  • Distance-based
  • Input is a matrix of distances between species
  • Can be fraction of residues they disagree on, or
    -alignment score between them, or
  • E.g., Neighbor-Joining, UPGMA, etc.
  • Character-based
  • Examine each character (e.g., residue) separately
  • E.g., Maximum Likelihood, Maximum Parsimony, Mr.
    Bayes
  • Question to test understanding
  • Is SATCHMO a distance-based or character-based
    method? What about SCI-PHY?

15
Distance-based Phylogenetic Methods
T. Warnow
16
Maximum Parsimony
  • Input Set S of n aligned sequences of length k
  • Output
  • A phylogenetic tree T leaf-labeled by sequences
    in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

E(T) is the set of edges in the tree T H is the
Hamming distance (minimum substitutions) between
edges (i,j).
T. Warnow
17
The fundamental principle behind maximum parsimony
  • Occams Razor
  • Entia non sunt multiplicanda praeter
    necessitatem.
  • William of Occam (1300-1349)

The best tree is the one which requires the least
number of substitutions
18
Maximum Likelihood
  • Input Set S of n aligned sequences of length k,
    and a specified parametric model
  • Output
  • A phylogenetic tree T leaf-labeled by sequences
    in S
  • With additional model parameters (e.g. edge
    lengths)
  • such that PrS(T, parameters) is maximized.
  • (Recall that ML methods seek to identify a model
    that maximizes the probability of the data.)

19
Maximum Likelihood (in English)
  • Require a model of evolution
  • Each substitution has an associated likelihood
    given a branch of a certain length
  • A function is derived to represent the likelihood
    of the data given the tree, branch-lengths and
    additional parameters
  • Find the tree that maximizes the likelihood

20
Molecular Clock
  • UPGMA implicitly assumes that all distances are
    clock-like

2
3
2
3
4
1
1
4
21
More about molecular clocks
  • The molecular clock (based on the molecular clock
    hypothesis (MCH)) is a technique in genetics to
    date when two species diverged. Elapsed time is
    deduced by applying a time scale to the number of
    molecular differences measured between the
    species' DNA sequences or proteins.

http//en.wikipedia.org/wiki/Molecular_clock
22
More about molecular clocks
  • The notion of the existence of a so-called
    "molecular clock" was first attributed to Emile
    Zuckerkandl and Linus Pauling who, in 1962,
    noticed that the number of amino acid differences
    in hemoglobin between lineages scales roughly
    with divergence times, as estimated from fossil
    evidence. They generalized this observation to
    assert that the rate of evolutionary change of
    any specified protein was approximately constant
    over time and over different lineages.
  • Later Allan Wilson and Vincent Sarich built upon
    this work and the work of Motoo Kimura observed
    and formalized that rare spontaneous errors in
    DNA replication cause the mutations that drive
    molecular evolution, and that the accumulation of
    evolutionarily "neutral" differences between two
    sequences could be used to measure time, if the
    error rate of DNA replication could be calibrated.

http//en.wikipedia.org/wiki/Molecular_clock
23
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge)
FP
Courtesy of T. Warnow
24
Two masking approaches
  • Construct many MSAs for same set of sequences
    (varying alignment parameters), concatenate
    alignments, and estimate trees (W. Wheeler)
  • Expectation is that uncertain regions of
    alignment will cancel each other out, whereas
    signal from regions with consensus support will
    be magnified
  • Delete noisy columns
  • Options include
  • Remove gappy columns only
  • Remove gappy and noisy columns

25
Rooting the tree (or not)
  • Most tree methods produce unrooted trees (even if
    they appear to be rooted)
  • In construction of species trees, use of an
    outgroup sequence is recommended (and relatively
    straightforward)
  • Choosing an outgroup for protein superfamily
    reconstruction is not so straightforward
  • Alternatives
  • Root the tree by finding the midpoint on the
    longest path between any pair of leaves (assumes
    molecular clock)
  • Root the tree by locating the root that requires
    the minimal number of mutations from the ancestor
    to current sequences

26
Rooting the tree using an outgroup sequence
  • Choosing an outgroup
  • For species trees choose a related organism that
    is more distant from any of the ingroup sequences
    than they are to each other.
  • For protein family trees choose a related
    protein that is evolutionarily more distant from
    any ingroup sequence than any are to each other
  • (how to do this since we dont know the
    evolutionary history??)
  • Place the root somewhere between the outgroup and
    the rest (either on the node or in a branch)

27
Bootstrap Analysis
  • Columns from the input alignment are resampled
    with replacement to create many bootstrap
    replicate data sets
  • each resampled MSA will have the same number of
    columns -- some columns may be sampled multiple
    times, and others not at all
  • A phylogenetic tree is constructed for each
    bootstrap replicate data set (e.g. with
    parsimony, distance, ML etc.)
  • Agreement among the resulting trees is
    summarized with a majority-rule consensus tree
    (e.g., PHYLIP consense program)
  • Frequencies of occurrence of groups, bootstrap
    proportions (BPs), are a measure of support for
    those groups
  • Bootstrap analysis does not actually tell you the
    expected accuracy of that subtree -- just the
    support for that subtree in the alignment

28
Phylogenetic uncertainty and subtree neighbors
If no known function(s) are available for a group
of sequences (e.g., subtree A, in the figure
above), function may be inferred based on data
available for sequences in a neighboring subtree.
(This idea is referred to as subtree neighbors
by Zmasek and Eddy.) However, the coarse
branching order between clades can be highly
variable. If you build four trees (NJ, ML, MP, Mr
Bayes) you may find four (or more!) different
tree topologies. Example SH2 domains. SH2
domains bind phosphopeptides the residues
responsible for binding are generally conserved
within subfamilies but vary across subfamilies.
In cases where we know something about the
function of members of a family, can we use this
information to evaluate which of the different
trees might be more reliable?
Src homology 2 domain 1SPSA
29
Recommendations on phylogeny estimation
  • Most important factor is the quality of the input
    data
  • Look at the data from as many angles as possible.
    Does this sound familiar??
  • Choice of outgroup taxa can have as much
    influence as the ingroup taxa.
  • Compute every analysis with several outgroups?
  • Recognize that different programs can be
    influenced by the order of the analysis, i.e. the
    order in which the sequences appear in your file.
  • PHYLIP and PAUP offer a "jumble" option that
    reruns the analysis with different input orders.

http//biology.unm.edu/biology/maggieww/Public_Htm
l/phylogeny.htm
30
Orthology prediction
31
Major papers on orthology prediction
  • Fitch, W.M. (2000) Homology a personal view on
    some of the problems. Trends Genet, 16, 227-231.
  • Sonnhammer and Koonin, Orthology, paralogy and
    proposed classification for paralog subtypes
    Trends Genetics 2002 (Introduces key terms such
    as inparalog)
  • Remm, M., Storm, C.E. and Sonnhammer, E.L. (2001)
    Automatic clustering of orthologs and in-paralogs
    from pairwise species comparisons. J Mol Biol,
    314, 1041-1052. (InParanoid method)
  • Christian M Zmasek and Sean R Eddy (2002) RIO
    Analyzing proteomes by automated phylogenomics
    using resampled inference of orthologs, BMC
    Bioinformatics (introduces the terms
    super-ortholog and ultra-paralog)
  • Tatusov, R.L. et al. (2003) The COG database an
    updated version includes eukaryotes. BMC
    Bioinformatics, 4, 41.
  • Chen, F., Mackey, A.J., Stoeckert, C.J., Jr. and
    Roos, D.S. (2006) OrthoMCL-DB querying a
    comprehensive multi-species collection of
    ortholog groups. Nucleic Acids Res, 34, D363-368.
  • Jensen, L.J., Julien, P., Kuhn, M., von Mering,
    C., Muller, J., Doerks, T. and Bork, P. (2008)
    eggNOG automated construction and annotation of
    orthologous groups of genes. Nucleic Acids Res,
    36, D250-254.
  • Ruan, J., Li, H., Chen, Z., Coghlan, A., Coin,
    L.J., Guo, Y., Heriche, J.K., Hu, Y.,
    Kristiansen, K., Li, R. et al. (2008) TreeFam
    2008 Update. Nucleic Acids Res, 36, D735-740.
    (phylogenomic identification of orthologs, but
    restricted to animal genomes)
  • Ruchira S. Datta, Christopher Meacham, Bushra
    Samad, Christoph Neyer and Kimmen Sjölander
    (2009) "Berkeley PHOG PhyloFacts Orthology Group
    Prediction Web Server," Nucleic Acids Research
    (2009 Web server issue)

32
Other papers about orthologs
  • Greg Petsko "Homologuephobia", Genome Biology
    2001
  • Eugene Koonin, "An Apology for Orthologs - or
    brave new memes", Genome Biology 2001
  • R A Jensen "Orthologs and Paralogs - we need to
    get it right", Genome Biology 2001 (includes a
    discussion of conflicting opinions by Eugene
    Koonin and Greg Petsko on this subject)
  • Iddo Friedberg, "Automated Function Prediction
    the Genomic Challenge", Briefings in
    Bioinformatics, 2006 (includes a discussion of
    how function prediction methods are evaluated)

33
Why is orthology important?
34
Orthology definitions
Orthology (standard definition) the MRCA must
correspond to a speciation event. (By this
definition, the yeast sequence is orthologous to
both H1 and H2, which are co-orthologs to the
yeast sequence.) Super-orthology is more
restrictive than orthology all nodes on a path
between two leaves must correspond to a
speciation event. (Zmasek Eddy, 2002)
S
Super-orthologs
D
H1 C1 M1 R1 F1 W1 H2 C2 M2 R2
F2 W2
Yeast
Advantages of super-orthology it is transitive,
and provides a secure basis for inference of
function. Disadvantages highly restrictive (low
coverage)
Human, Chimp, Mouse, Rat, Fly, Worm
35
New terms
  • Co-ortholog
  • Inparalog
  • Outparalog

Sonnhammer and Koonin, Orthology, paralogy and
proposed classification for paralog subtypes,
Trends in Genetics, 2002
36
Orthologs and paralogs - we need to get it right,
Roy A Jensen, Genome Biology 2001
37
Gene tree-species tree reconciliation is used to
identify orthologs
38
Ortholog prediction accuracy
PHOG-O Standard orthology definition PHOG-S
Super-orthologs PHOG-T thresholded
PHOGs PHOG-T(M) optimized for mouse PHOG-T(Z)
optimized for zebrafish PHOG-T(F) optimized for
fruit fly
Benchmark dataset 100 (non-homologous) human
sequences from TreeFam-A, filtered to remove
homologs, and manually curated orthologs from
mouse, zebrafish and fruit fly.
39
PHOG search results genes can belong to
multiple orthology groupsbased on regions
selected, or time points at which sequences were
clustered
40
How is PHOG different from InParanoid, OrthoMCL
and TreeFam?
  • PHOG is based on precalculated trees in
    PhyloFacts 55K protein families (and expanding
    daily)
  • PhyloFacts includes trees for individual domains
    as well as entire domain architectures
  • PhyloFacts trees include sequences from across
    the Tree of Life (all species, not restricted to
    whole genomes)
  • TreeFam is restricted to animal genomes
  • TreeFam uses gene tree-species tree
    reconciliation (PHOG does not)
  • This should enable TreeFam to be more accurate,
    assuming the species tree is correct (note this
    cannot always be assumed)
  • Advantages of PHOG
  • Individual domains provide different evolutionary
    perspectives
  • Improved taxonomic sampling (known to be critical
    for tree topology accuracy)
  • Requiring global homology is often too
    restrictive and can cause topology errors
  • Allowing local matches avoids problems with
    rejecting sequences due to gene model errors
    (very common among eukaryotes)

41
Relevant questions
  • What are the uses of orthologs?
  • What is the (accepted/common) definition of
    ortholog and paralog?
  • What is the most common method used to identify
    orthologs? Is this approach rigorous?
  • How can you be sure you have an ortholog? (What
    evidence is required to assert two proteins are
    orthologs?)
  • What is the actual significance of two proteins
    being orthologous?
  • How is orthology accuracy assessed? Is there a
    standard?
  • What is the expected accuracy of ortholog
    identification using different methods?
Write a Comment
User Comments (0)
About PowerShow.com