Building Phylogenetic Trees - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Building Phylogenetic Trees

Description:

Chapter 7 Building Phylogenetic Trees – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 76
Provided by: NHRI8
Category:

less

Transcript and Presenter's Notes

Title: Building Phylogenetic Trees


1
  • Chapter 7
  • Building Phylogenetic Trees

2
Contents
  • Phylogeny
  • Phylogenetic trees
  • How to make a phylogenetic tree from pairwise
    distances
  • UPGMA method ( an example)
  • Neighbor-Joining method ( an example)
  • Comparison of methods
  • Conclusion

3
Phylogeny
  • Phylogeny is the evolution of related
    species/genes
  • Phylogenetic tree diagram showing evolutionary
    lineages of species/genes
  • The history of genes or species may be very
    different
  • Genes can be homologous or analogous, but still
    remind each other

4
Phylogeny
  • The similarity of molecular mechanisms of the
    organisms that have been studied strongly
    suggests that all organisms on Earth had a common
    ancestor
  • Any set of species is related, and this
    relationship is called a phylogeny
  • The relationship can be represented by a
    phylogenetic tree

5
Phylogeny
  • Traditionally, morphological characters (both
    from living and fossilized organisms) have been
    used for inferring phylogenies
  • Zuckerkandel Pauling (1962) showed that
    molecular sequences provide sets of characters
    that can carry a large amount information
  • If we have a set of sequences from different
    species , we may be able to use them to infer a
    likely phylogeny of the species in question
  • This assumes that the sequences have descended
    from some common ancestral gene in a common
    ancestral species

6
Phylogeny
  • The widespread occurrence of gene duplication
    means that the foregoing assumption needs to be
    checked carefully
  • The phylogentic tree of a group of seqences does
    not necessarily reflect the phylogenetic tree of
    their host species, because gene duplication is
    another mechanism, in addition to speciation, by
    which two sequences can be separated and diverge
    from a common ancestor
  • Genes which diverged because of speciation

7
Phylogeny
  • Genes which diverged because of speciation are
    called orthologues (????)
  • Genes which diverged by gene duplication are
    called paralogues (??????)

8
Phylogeny
  • Homologous sequences can be divided into two
    parts
  • Orthologous sequences diverged by specification
    from a common ancestor
  • Paralogous sequences evolved by gene dublication
    within species
  • Analogous sequences may appear and function very
    similarly, but they do not have a common ancestor
  • WHEN WE WANT TO EXPLORE EVOLUTIONARY
    RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS
    SEQUENCES

9
(No Transcript)
10
Orthologues / Paralogues
11
Orthology/paralogy
Orthologous genes are homologous (corresponding)
genes in different species (genomes) Paralogous
genes are homologous genes within the same
species (genome)
12
Phylogenetic Trees
  • WHY construct a phylogenetic tree?
  • to understand lineage of various species
  • to understand how various functions evolved
  • to inform multiple alignments
  • Trees can be rooted (a common ancestor in known)
    or unrooted
  • Leaves are the terminal nodes that correspond to
    the observed sequences of genes or species (A, B,
    C, D)
  • Internal nodes are hypothetical ancestral nodes
  • All trees will be assumed to be binary, meaning
    that an edge that branches splits into two
    daughter edges
  • Each edge has a certain amount of evolutionary
    divergence associated to it, defined by some
    measure of distance between sequences, or from a
    model of substitution of residues over the course
    of evolution

13
(No Transcript)
14
Phylogenetic Trees
  • We adopt the general term length or edge
    length here, and represent this by the lengths
    of edge in the figures we draw
  • A true biological phylogeny has a root, or
    ultimate ancestor of all the sequences
  • The leaves of trees have names or numbers
  • A tree with a given labelling will be called a
    labelled branching pattern
  • We refer to this as the tree topology and denote
    it by the symbol T
  • The lengths of its edges are denoted by ti with a
    suitable numbering scheme for the is

15
Rooted / Unrooted Tree
16
Types of trees
Unrooted tree represents the same phylogeny
without the root node
Depending on the model, data from current day
species often does not distinguish between
different placements of the root.
17
Rooted versus unrooted trees
Tree c
b
a
c
Represents all three rooted trees
18
Rrooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
Unrooted tree
19
Counting Trees
20
Counting Trees
(2N - 5)!! unrooted trees for N taxa (2N-
3)!! rooted trees for N taxa
21
How many trees?
  • Number of unrooted trees (2n-5)! / 2n-3 (n-3)!

  • 3x5xx(2n-5)
  • Number of rooted trees (2n-3)! / 2n-32(n-2)!

  • 3x5xx(2n-3)

22
Combinatoric explosion
  • sequences unrooted rooted
  • trees trees
  • 2 1 1
  • 3 1 3
  • 4 3 15
  • 5 15 105
  • 6 105 945
  • 7 945 10,395
  • 8 10,395 135,135
  • 9 135,135 2,027,025
  • 10 2,027,025 34,459,425

23
Phylogenetic trees
  • Different ways to represent a phylogenetic tree
    (illustrated by Treeview)

24
Making a tree from pairwise distances
  • Distances dij between each pair of sequences i
    and j are calculated in the given dataset
  • Different ways defining distances
  • For nucleotide sequences
  • Jukes-Cantor, Kimura-2-parameter K2P, HKY
    (Hasegawa-Kishino-Yano), F84, Tamura-Nei, General
    time-reversible model, General 12-parameter model
  • For amino acid sequences
  • PAM-matrices, BLOSUM-matrices

A B C D
A 0 32 44 46
B 32 0 29 43
C 44 29 0 30
D 46 43 30 0
25
Distance matrix methods
  • UPGMA
  • Algorithm introduced by Sokal and Michener 1958
  • Neighbor-Joining
  • Algorithm introduced by Saitou and Nei 1987
  • Modified by Studier and Keppler 1988

26
Clustering method UPGMA
  • UPGMA Unweighted pair group method using
    arithmetic averages
  • Simple method
  • It works by clustering the sequences, at each
    stage connecting two clusters and finally
    creating a new node on a tree
  • Method assumes equal rate of evolutionary change
    along branches ? Molecular clock assumption

27
UPGMA
A
C
B
  • UPGMA produces a rooted tree
  • Branch lengths satisfy a molecular clock
  • ? The divergence of sequences is assumed to occur
    at the same constant rate at all points in the
    tree
  • Trees that are clocklike are rooted and the total
    branch length from the root up to any leaf is
    equal
  • Trees are often referred to be ultrametric
  • A distance measures are ultrametric if either all
    three distances are equal
  • dij dik djk or two of them are equal and one
    is smaller djk lt dij dik
  • ? UPGMA is guaranteed to build the correct tree
    if distances are ultrametric
  • Method can be used for reconstructing phylogenies
    if evolutionary rates are assumed to be same in
    all lineages ? criticism in the phylogeny
    literature
  • Suitable for the species closely related
  • Running time O(n2)

D
28
Algorithm UPGMA
Initialisation Assign each sequence i in
dataset to its own cluster Define one leaf of T
for each sequence, and place at height
zero Iteration Find the two clusters i and j
for which dij is the smallest (pick randomly if
several equal distances) Define a new cluster ij
by Cij Ci U Cj. Cluster ij has nij ni nj
members ( initially ni 1 ) Connect i and j on
the tree to a new node v The branch lengths from
new node to i and j are placed at height
29
Algorithm UPGMA (cont.)
  • Iteration (cont.)
  • Compute the distances between the new cluster
    and the remaining clusters by using
  • Add ij to the current clusters and remove i and
    j
  • Termination
  • When only two clusters i and j remain, place the
    root
  • at height

30
UPGMA -- Unweighted Pair Group Method with
Arithmetic mean
simplest method - uses sequential clustering
algorithm (assumption of rate constancy among
lineages - often violated)
step 1 step 2
(AB) C d(AB)C
Distance matrix Tree
d(AB)C (dAC dAB) / 2
31
UPGMA -- Ilustrations
32
An example UPGMA (1)
  • Distance matrix (arbitrary)
  • for four items (sequences)
  • A, B, C and D
  • Actually distances are not ultrametric, because
    three distances are not equal
  • dij ? dik ? djk or two of them are not equal and
    one is smaller
  • djk lt dij ? dik

A B C D
A 0 8 7 12
B 8 0 9 14
C 7 9 0 11
D 12 14 11 0
Step 1. Find the smallest distance, dij, between
two clusters ? A and C, where dij is 7
33
An example UPGMA (2)
  • Step 2. Define new cluster ij, which has nij
    ni nj
  • members (initially ni 1)
  • New cluster ? A and C
  • nAC nA nC2
  • Step 3. Connect A and C on the tree to a new
    node v1
  • Step 4. The branch lengths from new node v1 to A
    and C

A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
3,5
A
C
3,5
34
An example UPGMA (3)
  • Step 5. Compute the distances between the new
    cluster AC and the remaining clusters (B and D)

35
An example UPGMA (4)
  • Step 6. Delete the columns and rows of the
    distance matrix that correspond to clusters A and
    C, and add a column and a row for cluster AC

AC B D
AC 0 8.5 11.5
B 0 14
D 0
?New distance matrix
36
An example UPGMA (5)
  • 2nd iteration process
  • Step 1. Find the two sequences i and j for which
    dij
  • is the smallest (randomly if several equal
    distances)
  • ?AC-B
  • Step 2. Define new cluster (ij), which has nij
    ni nj
  • members ( initially ni 1 ) New cluster ? AC and
    B
  • nACB nAC nB 2 1 3
  • Step 3. Connect AC and B on the tree to a new
    node v2
  • Step 4. The branch lengths from new node v2 to AC
    and B
  • ?

AC B D
AC 0 8.5 11.5
B 0 14
D 0
3.5
A
C
3.5
B
4.25
37
An example UPGMA (6)
  • Step 5. Compute the distances between the new
    cluster and the remaining cluster (D)
  • Step 6. Delete the columns and rows of the
    distance matrix that correspond to clusters AC
    and B, and add a column and a row for cluster ACB

ACB D
ACB 0 12.33
D 0
?New distance matrix
38
An example UPGMA (7)
Termination Only two clusters (ACB and D)
remaining Place the root height
ACB D
ACB 0 12.33
D 0
Original distance matrix and final phylogenetic
tree(including the branch lengths)
3.5
A
0.75
A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
C
1.92
3.5
B
4.25
D
6.17
39
When UPGMA fails
40
When UPGMA fails
  • The closest leaves are not neighboring leaves
    they do not have a common parent node
  • A test of whether reconstruction is likely to be
    correct is the ultrametric condition
  • A distance measures are ultrametric if either all
    three distances are equal
  • dij dik djk or two of them are equal and one
    is smaller djk lt dij dik

41
Ultrametric Distances
Given three leaves, two distances are equal while
a third is smaller d(i,j) ? d(i,k) d(j,k) aa
? ab ab
i
nodes i and j are at same evolutionary distance
from k the dendrogram will therefore have
aligned leaves i.e. they are all at the same
distance from root
a
b
k
a
j
42
Evolutionary clock speeds
Uniform clock Ultrametric distances lead to
identical distances from root to leaves
Non-uniform evolutionary clock leaves have
different distances to the root -- an important
property is that of additive trees. These are
trees where the distance between any pair of
leaves is the sum of the lengths of edges
connecting them. Such trees obey the so-called
4-point condition (next slide).
43
Additivity
  • Given a tree, its edge lengths are said to be
    additive if the distance between any pair of
    leaves is the sum of the lengths of the edges on
    the path connecting them
  • This property is built in automatically as the
    UMGMA tree is constructed
  • It is possible for the molecular clock property
    to fail but for additivity to hold, and in that
    case there are algorithms that can be used to
    reconstruct the tree correctly

44
Neighbor Joining
  • Very popular method
  • Does not make molecular clock assumption
    modified distance matrix constructed to adjust
    for differences in evolution rate of each taxon
  • Produces unrooted tree
  • Assumes additivity distance between pairs of
    leaves sum of lengths of edges connecting them
  • Like UPGMA, constructs tree by sequentially
    joining subtrees

45
Neighbor Joining Once we know the correct (i,j)
pair
46
  • dimdikdkm
  • djmdjkdkm
  • dimdjmdikdjk2dkmdij2dkm
  • dkm(dimdjm-dij)/2

47
Neighbour Joining why not pick the smallest
(i,j) pair?
48
Neighbour Joining(3)
49
Neighbour Joining Algorithm
50
Neighbor-Joining Complexity
  • The method performs a search using time O(n2) and
    using time O(n2) to update distance matrix.
  • Giving a total time complexity of O(n3),and a
    space complexity of O(n2).

51
Neighbor-Joining
  • We can use neighboring-joining even lengths are
    not additive, but reconstruction of the correct
    tree is no longer guaranteed
  • We can test for additivity
  • For every set of four leaves, i, j, k, and l, two
    of the distances dijdkl, dikdjl and dildjk
    must be equal and larger than the third
  • dijdkl dikdjl gt dildjk

52
Additivity
53
Additivity
  • Theorem A set M of L objects is additive iff any
    subset of four objects can be labeled i,j,k,l so
    that
  • d(i,k) d(j,l) d(i,l) d(k,j) d(i,j)
    d(k,l)

54
Additive trees
All distances satisfy 4-point condition For all
leaves i,j,k,l d(i,j) d(k,l) ? d(i,k)
d(j,l) d(i,l) d(j,k) (ab)(cd) ?
(amc)(bmd) (amd)(bmc)
k
i
a
c
m
b
d
j
l
Result all pairwise distances obtained by
traversing the tree
55
An example N-J (1)
A B C D Step 1 - ri
A 0 8 7 12 (8712)/(4-2) 13.5
B 8 0 9 14 (8914)/(4-2)15.5
C 7 9 0 11 (7911)/(4-2)13.5
D 12 14 11 0 (121411)/(4-2)18.5
Step 1. Compute for each row in distance
matrix Step 2. Compute (the lower-diagonal
matrix) and choose the smallest (most
negative)
A B C D
A 0 8 7 12
B 8-(13.515.5)-21 0 9 14
C 7-(13.513.5)-20 9-(15.513.5) -20 0 11
D 12-(13.518.5)-20 14-(15.518.5)-20 11-(13.518.5)-21 0
56
An example N-J (2)
Step 3. Join A and B together with a new node
v1. Compute the edge lengths, from A to node v
and from B to node v1 Step 4. Compute
distances between the new node v1 and remaining
items (C and D)
B
5
v1
3
A
57
An example N-J (3)
New reduced distance matrix
Step 5. Delete A and B from the distance matrix
and replace them by new item AB Step 6.
Continue from step 1, because more than two items
remain Step 1. Compute for each row
in distance matrix Step 2 Compute and choose
the smallest (the lower-diagonal matrix)
AB C D Step 1 ri
AB 0 4 9 (49)/113
C 4 0 11 (411)/115
D 9 11 0 (911)/120
AB C D
AB 0 4 9
C 4-(1315)-24 0 11
D 9-(1320)-24 11-(1520)-24 0
58
An example N-J (4)
AB C D Step 1 ui
AB 0 4 9 (49)/113
C 4 0 11 (411)/115
D 9 11 0 (911)/120
Step 3 Join v1 and C together with a new node
v2. Compute the edge lengths, from v1 to node v2
and from C to node v2 Step 4 Compute
distances between the new node v2 and remaining
items (D)
B
5
v1
v2
1
3
3
A
C
59
An example N-J (5)
Step 5 Delete AB and C from the distance matrix
and replace them by ABC Step 6 Only two nodes
remaining ? connect them
ABC D
ABC 0 8
D 0
Original distance matrix and final phylogenetic
tree (including the edge lengths)
D
A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
8
B
5
1
3
3
A
C
60
Comparison
  • UPGMA
  • The total branch length from the root up to any
    leaf is equal
  • Produces a rooted tree, where the root is
    hypothesized ancestor of the sequences in the
    tree
  • Suitable for closely related sequences
  • Can be used to infer phylogenies if one can
    assume that evolutionary rates are the same in
    all lineages
  • Neighbor-joining
  • Unrooted tree, where the direction of evolution
    is unknown
  • Suitable for datasets with largely varying rates
    of evolution
  • Suitable for large datasets

D
8
3.5
A
B
5
C
3.5
1
B
3
3
A
C
4.25
D
6.17
61
Comparison
  • UPGMA method constructs a rooted phylogenetic
    tree correctly if there is a molecular clock with
    a constant rate of mutation
  • UPGMA method is rarely used, because molecular
    clock assumption is not generally true selection
    pressures vary across time periods, genes within
    organisms, organisms, regions within gene
  • N-J method produces an unrooted tree without
    molecular clock hypothesis
  • N-J method is one of the most popular and widely
    used by molecular evolutionist
  • Distance methods are strongly dependent on the
    model of evolution used
  • Sequence information is reduced when transforming
    sequence data into distances
  • Distance methods are computationaly fast

62
Parsimony
  • Find the tree which can explain the observed
    sequences with a minimal number of substitutions
  • It assigns a cost to a tree, and it is necessary
    to search through all topologies, or to pursue a
    more efficient search strategy that achieves this
    effect, in order to identify the best tree

63
Parsimony
  • The computation of a cost for a given tree
  • A search through all trees, to find the overall
    minimum of this cost
  • Suppose we have the following four aligned
    nucleotide sequences
  • AAG
  • AAA
  • GGA
  • AGA

64
Parsimony
65
Cost of Evaluating Parsimony
  • Score is evaluated on each position independetly.
    Scores are then summed over all positions.
  • If there are n nodes, m characters, and k
    possible values for each character, then
    complexity is O(nmk)
  • By keeping traceback information, we can
    reconstruct most parsimonious values at each
    ancestor node

66
Evaluating Parsimony Scores
  • How do we compute the Parsimony score for a given
    tree?
  • Traditional Parsimony
  • Each base change has a cost of 1
  • Weighted Parsimony
  • Each change is weighted by the score c(a,b)

67
Traditional Parsimony
a
a
  • Solved independently for each position
  • Linear time solution

a,g
a
68
Traditional Parsimony
69
Traditional Parsimony
  • There is a traceback procedure for finding
    ancestral assignments in traditional parsimony
  • We choose a residue from R2n-1, then proceed down
    the tree
  • Having chosen a residue from the set Rk, we pick
    the same residue from the daughter set Ri if
    possible, and otherwise pick a residue at random
    from Ri

70
Traditional Parsimony is not complete
71
Weighted Parsimony
72
Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
73
Parsimony Distance
parsimony
Sequences 1 2 3 4 5 6
7 Drosophila t t a t t a a fugu a
a t t t a a mouse a a a a a t a
human a a a a a a t
Drosophila
mouse
1
6
4
5
2
3
7
human
fugu
distance
human x mouse 2 x fugu 4 4
x Drosophila 5 5 3 x
Drosophila
mouse
2
1
2
1
1
human
fugu
human
mouse
fugu
Drosophila
74
How to assess confidence in tree
  • Distance method bootstrap
  • Select multiple alignment columns with
    replacement
  • Recalculate tree
  • Compare branches with original (target) tree
  • Repeat 100-1000 times, so calculate 100-1000
    different trees
  • How often is branching (point between 3 nodes)
    preserved for each internal node?
  • Uses samples of the data

75
The Bootstrap -- example
1 2 3 4 5 6 7 8 - C V K V I Y S M A V R -
I F S M C L R L L F T 3 4 3 8 6 6 8 6 V K
V S I I S I V R V S I I S I L R L T L L T L
5
1 2 3
Original
4
2x
3x
1
1 2 3
Non-supportive
Scrambled
5
Write a Comment
User Comments (0)
About PowerShow.com