Title: Phylogenetic%20Analysis
1Phylogenetic Analysis
2Introduction
- Intension
- Using powerful algorithms to reconstruct the
evolutionary history of all know organisms. - Phylogenetic tree
- It can help understand the evolutionary
relationships among species of organisms. - But we have to infer the evolutionary history of
current organisms.
3Campanulaceae (bluebell) family
Herpesviruses
4Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or Lineages
A
Represent the TAXA (genes, populations, species,
etc.) used to infer the phylogeny
B
C
D
Ancestral Node or ROOT of the Tree
E
Internal Nodes or Divergence Points (represent
hypothetical ancestors of the taxa)
5Three types of trees
Cladogram Phylogram
Ultrametric tree
6
Taxon B
Taxon B
Taxon B
1
1
Taxon C
Taxon C
Taxon C
3
1
Taxon A
Taxon A
Taxon A
Taxon D
Taxon D
5
Taxon D
no meaning
genetic change
All show the same evolutionary relationships, or
branching orders, between the taxa.
6Phylogenetic trees diagram the evolutionary
relationships between the taxa
((A,(B,C)),(D,E)) The above phylogeny as
nested parentheses
These say that B and C are more closely related
to each other than either is to A, and that A, B,
and C form a clade that is a sister group to the
clade composed of D and E. If the tree has a
time scale, then D and E are the most closely
related.
7The goal of phylogeny inference is to resolve
the branching orders of lineages in evolutionary
trees
Completely unresolved or "star" phylogeny
Partially resolved phylogeny
Fully resolved, bifurcating phylogeny
8There are three possible unrooted trees for four
taxa (A, B, C, D)
Phylogenetic tree building (or inference) methods
are aimed at discovering which of the possible
unrooted trees is "correct". We would like this
to be the true biological tree that is, one
that accurately represents the evolutionary
history of the taxa. However, we must settle for
discovering the computationally correct or
optimal tree for the phylogenetic method of
choice.
9The number of unrooted trees increases in a
greater than exponential manner with number of
taxa
(2N - 5)!! unrooted trees for N taxa (2N-
3)!! rooted trees for N taxa
10Introduction
- NP-Hard optimization problem
- Unrooted trees of n organisms TU(n)
- Edges of unrooted trees of n organisms E(n)
2n-3 , ngt2 - TU(n) TU(n-1)E(n-1) ?E(i) ?(2i-5)
- Ex.
- Rooted trees of n organisms TR(n)
TU(n)E(n) TU(n1)
n-1
n
i2
i3
add t
t
t
t
11Inferring evolutionary relationships between the
taxa requires rooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
Unrooted tree
12Now, try it again with the root at another
position
B
C
Root
Unrooted tree
D
A
A
B
C
D
Rooted tree
Note that in this rooted tree, taxon A is most
closely related to taxon B, and together they are
equally distantly related to taxa C and D.
Root
13An unrooted, four-taxon tree theoretically can be
rooted in five different places to produce five
different rooted trees
A
C
The unrooted tree 1
D
B
These trees show five different evolutionary
relationships among the taxa!
14All of these rearrangements show the same
evolutionary relationships between the taxa
Rooted tree 1a
D
C
A
B
15Molecular phylogenetic tree building
methods Are mathematical and/or statistical
methods for inferring the divergence order of
taxa, as well as the lengths of the branches that
connect them. There are many phylogenetic
methods available today, each having strengths
and weaknesses. Most can be classified as
follows
16parsimony
- model complexity vs. sample size
- minimize Hamming distance summed over all edges
of the tree - justification minimum possible number of
evolutionary events - subject of serious dispute by systematic
biologists
17Method
- Maximum parsimony (MP)
- Seek the tree that minimizes the total number of
evolutionary events on the edges of tree - Ex.
- Require two algorithms
- Search over tree topology
- The computation of a cost for a given tree
AAA
1
AAA
AGA
1
1
AGA
AAA
AAG
GGA
18maximum likelihood
- estimate probability that a specific evolutionary
model will produce a particular phylogeny
yielding the observed sequences - many evolutionary models
19Method
- Maximum likelihood (ML)
- Seek the tree that maximizes likelihood
P(datatree) - Ex.
- Compute likelihoodP(x1,x2,x3T,t1,t2,t3,t4)
- x a set of sequences
- T a tree
- t edge lengths of tree
- Require two algorithms
- Search over tree topology
- Search over all possible lengths of edges t to
compute likelihood
root
X5
t4
X4
t3
t2
t1
X2
X1
X3
20Distance Matrix Methods
- produce a tree such that the path distance
between leaves i and j (sum of edge weights in
the path between i and j) equals Dij - this the additive property for a distance matrix
-- of course real distance matrices may not be
additive - most methods use agglomerative clustering --
successively choosing pairs of nodes to combine
21Ultrametric trees
- path distance from the root to each leaf is the
same - strong molecular clock assumption - distance is
proportional to evolutionary time
22Example Tree and Additive Matrix
23Distance Matrix Methods
- UPGMA
- Neighbor Joining
- Fitch Margoliash
- Quartet Puzzling
- Witness-Anitwitness
- Double Pivotmany are not yet in use by the
systematic biology community
24Distance Measures
- DNA hybridization amounts
- immunological distances
- genetic distances
- sequence distances (DNA, RNA, protein)
25what distance?
- need distance measure that reflects the actual
number of point mutations on the path between the
leaves - particular problem with sequence data - Hamming
distance and assumption of no reversals
26UPGMA
- Unweighted Pair-Group Method with Arithmetic mean
27UPGMA Step 1combine B and C
28UPGMA step 2combine BC and D
(1012)/2
(46)/2
29UPGMA step 3combine A and E
30UPGMA step 4combine AE and BCD
31UPGMA Result
3.5
32UPGMA Result
3.5
33Method
- Phylogenetic reconstruction techniques
- NJ (neighbor-joining method)
- A star tree is successively inserted branches
between a pair of closest neighbors and the
remaining terminals in the tree - Character
- The fastest reconstruction method
- Poor accuracy when the distance matrix contains
large value
34Method
- Ex.
- The cost save by pairing S1 and S2 New
connection cost (NC) Old connection cost (OC)
2.34 NC ½(average(S1)average(S2)d(S1,S2))
6.33 OC average(S1) average(S2) 8.67 - The largest cost save by pairing S3 and S4
2.67Thus we pair S3 and S4
S1 S2 S3 S4
S1 0 4 4 3
S2 0 6 5
S3 0 2
S4 0
Distance matrix
Star tree
Pair S1 and S2
35Neighbor-Joining Result
36Genome Rearragement
- Generalized Nadean-Tayor (GNT) evolution model
- P(transpostion) a
- P(inverted trans.) ß
- P(inversion) 1-(aß)
- events on edge according to
Poissondistributionf(x)
x1,2,..
?xe-3 x!
Genome rearrangement
37Improving reconstruction algorithms
38Improving reconstruction algorithms
- Estimators of true evolutionary distance
- Exact-IEBP (inverting the expected breakpoint
distance)ML estimate of the breakpoint distance
after K rearrangements - Approx-IEBPapproximate Exact-IEBP
- EDE (empirically derived estimator)empirical
estimate of the inversion distance after K
rearrangements - produced a nonlinear regression formula that
computes the expected distance given that K
random rearrangements
39Conclusion
- New generation of phylogenetic software needs
- More sophisticated models of evolution
- Faster optimization algorithms
- High performance algorithm engineering
- Powerful modes of user interaction