Title: Phylogenetic Trees Lecture 12
1Phylogenetic TreesLecture 12
Based on pages 160-176 in Durbin et al (the black
text book).
This class has been edited from Nir Friedmans
lecture which was available at www.cs.huji.ac.il/
nir. Pictures from Tal Pupko slides. Changes by
Dan Geiger and Shlomo Moran.
2Evolution
- Evolution of new organisms is driven by
- Diversity
- Different individuals carry different variants of
the same basic blue print - Mutations
- The DNA sequence can be changed due to single
base changes, deletion/insertion of DNA segments,
etc. - Selection bias
3The Tree of Life
Source Alberts et al
4Tree of life- a better picture
Daprès Ernst Haeckel, 1891
5Primate evolution
A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species also called a
phylogenetic tree.
6Morphological vs. Molecular
- Classical phylogenetic analysis morphological
features number of legs, lengths of legs, etc. - Modern biological methods allow to use molecular
features - Gene sequences
- Protein sequences
- Analysis based on homologous sequences (e.g.,
globins) in different species - Important for many aspects of biology
- Classification
- Understanding biological mechanisms
7Morphological topology
(Based on Mc Kenna and Bell, 1997)
Archonta
Ungulata
8From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE
PGGLVVPPTDA Cat REPGGLVVPPTEG
There are many possible types of sequences to use
(e.g. Mitochondrial vs Nuclear proteins).
9Mitochondrial topology
(Based on Pupko et al.,)
10Nuclear topology
(Based on Pupko et al. slide)
(tree by Madsenl)
11Theory of Evolution
- Basic idea
- speciation events lead to creation of different
species. - Speciation caused by physical separation into
groups where different genetic variants become
dominant - Any two species share a (possibly distant) common
ancestor
12Phylogenenetic trees
- Leafs - current day species
- Nodes - hypothetical most recent common ancestors
- Edges length - time from one speciation to the
next
13Dangers in Molecular Phylogenies
- Gene and protein sequences can be homologous for
various reasons - Orthologs -- sequences diverged after a
speciation event. Indicative of a new specie. - Paralogs -- sequences diverged after a
duplication event. - Xenologs -- sequences diverged after a horizontal
transfer (e.g., by virus).
14Gene Phylogenies
Phylogenies can be constructed to describe
evolution genes.
Three species termed 1,2,3. Two paralog genes A
and B.
15Dangers of Paralogs
- If we happen to consider only species 1A, 2B, and
3A, we get a wrong tree that does not represent
the phylogeny of the host species of the given
sequences because duplication does not create new
species.
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
In the sequel we assume all given sequences are
orthologs.
16Types of Trees
- A natural model to consider is that of rooted
trees
Common Ancestor
17Types of trees
- Unrooted tree represents phylogeny without the
root node
Depending on the model, data from current day
species does not distinguish between different
placements of the root. In this example there
are seven possible ways to place a root.
18Rooted versus unrooted trees
Tree c
b
a
c
Represents the three rooted trees
Slide by Tal Pupko
19Positioning Roots in Unrooted Trees
- We can estimate the position of the root by
introducing an outgroup - a set of species that are definitely distant from
all the species of interest
Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
20Type of Data
- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residue they disagree on, or
alignment score between them, or - Character-based
- Examine each character (e.g., residue) separately
21Three Methods of Tree Construction
- Distance- A tree that recursively combines two
nodes of the smallest distance. - Parsimony A tree with a total minimum number of
character changes between nodes. - Maximum likelihood - Finding the best Bayesian
network of a tree shape. The method of choice
nowadays. Most known and useful software called
phylip uses this method. http//evolution.genetics
.washington.edu/phylip.html
22Distance-Based (1st type Method)
- Input distance matrix between species
- Outline
- Cluster species together
- Initially clusters are singletons
- At each iteration combine two closest clusters
to get a new one
23UPGMA Clustering
- Let Ci and Cj be clusters, define distance
between them to be - When we combine two cluster, Ci and Cj, to form a
new cluster Ck, then - Define a node K and place its daughter nodes at
depth d(Ci,Cj)/2
24Example
UPGMA construction on five objects. The length of
an edge its (vertical) height.
9
8
0.5d(7,8)
6
7
0.5d(2,3)
2
3
4
5
1
25Molecular clock
This phylogenetic tree has all leaves in the same
level. When this property holds, the
phylogenetic tree is said to satisfy a molecular
clock. Namely, the time from a speciation event
to the formation of current species is identical
for all paths (wrong assumption in reality).
26Molecular Clock
UPGMA constructs trees that satisfy a molecular
clock, even if the true tree does not satisfy a
molecular clock.
UPGMA
27Restrictive Correctness of UPGMA
Proposition If the distance function is derived
by adding edge distances in a tree T with a
molecular clock, then UPGMA will reconstruct T.
28Additivity
- Molecular clock defines additive distances,
namely, distances between objects can be realized
by a tree
29Basic property of Additivity
- Suppose input distances are additive
- For any three leaves
- Thus
m
c
b
j
a
k
i
30Constructing additive treesThe neighbor finding
problem
- Can we use this fact to construct trees assuming
only additivity (but not a molecular clock)?
Yes. The formula shows that if we knew that i
and j are neighboring leaves, then we can
construct their parent node k and compute the
distances of k to all other leaves m. We remove
nodes i,j and add k.
31Neighbor Finding
- How can we find from distances alone that a pair
of nodes i,j are neighboring leaves? - Closest nodes arent necessarily neighbors.
Next we show one way to find neighbors from
additive distances.
32Neighbor Finding
Theorem (SaitouNei) Assume all edge weights are
positive. If D(i,j) is minimal (among all pairs
of leaves), then i and j are neighboring leaves
in the tree.
33Neighbor Joining Algorithm
- Set L to contain all leaves
- Iteration
- Choose i,j such that D(i,j) is minimal
- Create new node k, and set
- remove i,j from L, and add k
- Terminatewhen L 2, connect two remaining
nodes
34Neighbor Finding
Notations used in the proof p(i,j) the path
from vertex i to vertex j P(D,C) (e1,e2,e3)
(D,E,F,C)
For a vertex i, and an edge e(i,j) Ni(e)
k e is on p(i,k). ND(e1) 3, ND(e2) 2,
ND(e3) 1 NC(e1) 1
E
F
35Neighbor Finding
Notation For e(i,m), we denote d(i,m) by d(e).
Rest of T
k
l
i
j
36Neighbor Finding
Proof of Theorem Assume by contradiction that
D(i,j) is minimal for i,j which are not
neighboring leaves. Let (i,l,...,k,j) be the path
from i to j. Let T1 and T2 be the subtrees
rooted at l and k. Let T denote the number
of leaves in T.
37Neighbor Finding
Case 1 i or j has a neighboring leaf. WLOG j and
m are such leaves. A. D(i,j) - D(m,j)(L-2)(d(i,j)
- d(j,m) ) (rirj) rm rj
Definition (L-2)(d(i,k)-d(k,m) )rm-ri
Figure
B. rm-ri (L-2)(d(k,m)-d(i,l)) (4-L)d(k,l)
LemmaFigure (since for each
edge e?P(k,l), Nm(e)2 and Ni(e) ? L-2, so
Nm(e)- Ni(e ) 4-L )
Substituting B in A D(i,j) - D(m,j)
(L-2)(d(i,k)-d(i,l)) (4-L)d(k,l) 2d(k,l) gt 0,
contradicting the minimality assumption.
38Neighbor Finding
Case 2 Not case 1. Then both T1 and T2 contain 2
neighboring leaves. We show that if D(i,j) is
minimal, then we must have both T1 gt T2 and
T2 gt T1 - which is a contradiction, hence
D(i,j) is not minimal.
We prove that T1 gt T2 by assuming that T1
T2 and reaching a contradiction. The proof
that T2 gt T1 is similar. Let n,m be
neighboring leaves in T1.
39Neighbor Finding
A. 0 D(m,n) - D(i,j) (L-2)(d(m,n) - d(i,j) )
(rirj) (rmrn)
B. rj-rmlt (L-2)(d(j,k) d(m,p))
(T1-T2)d(k,p) (Because Nj(e)- Nm(e ) lt
T1-T2).
C. ri-rn lt (L-2)(d(i,k) d(n,p))
(T1-T2)d(l,p) Adding B and C, noting that
d(l,p)gtd(k,p) and using the assumption T1 -
T2 0 D. (rirj) (rmrn) lt
(L-2)(d(i,j)-d(n,m)) 2(T1-T2)d(k,p)
Substituting D in the right hand side of A 0
D(m,n) - D(i,j)lt 2(T1-T2)d(k,p), hence
T1-T2 gt 0, a contradiction.