Title: Solving Phylogenetic Trees
1Solving Phylogenetic Trees
- Benjamin Loyle
- March 16, 2004
- Cse 397 Intro to MBIO
2Table of Contents
- Problem Term Definitions
- A DCM-NJ Solution
- Performance Measurements
- Possible Improvements
3Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
4DNA Sequence Evolution
5Problem Definition
- The Tree of Life
- Connecting all living organisms
- All encompassing
- Find evolution from simple beginnings
- Even smaller relations are tough
- Impossible
- Infer possible ancestral history.
6So what.
- Genome sequencing provides entire map of a
species, why link them? - We can understand evolution
- Viable drug testing and design
- Predict the function of genes
- Influenza evolution
7Why is that a problem?
- Over 8 million organisms
- Current solutions are NP-hard
- Computing a few hundred species takes years
- Error is a very large factor
8What do we want?
- Input
- A collection of nodes such as taxa or protein
strings to compare in a tree - Output
- A topological link to compare those nodes to each
other - When do we want it?
- FAST!
9Preparing the input
- Create a distance matrix
- Sum up all of the known distances into a matrix
sized n x n - N is the number of nodes or taxa
- Found with sequence comparison
10Distance Matrix
Take 5 separate DNA strings A GATCCATGA B
GATCTATGC C GTCCCATTT D AATCCGATC E
TCTCGATAG The distance between A and B is 2 The
distance between A and C is 4 This is subjective
based on what your criteria are.
11Distance Matrix
- Lets start with an example matrix
A
B
C
D
E
A
B
C
D
E
12Lets make it simple (constrain the input)
- Lets keep the distance between nodes within a
certain limit - From F -gt G
- F and G have the largest distance they are the
most dissimilar of any nodes. - This is called the diameter of the tree
- Lets keep the length of the input (length of the
strings) polynomial.
13ERROR?!?!!?
- All trees are inferred, how do you ever know if
youre right? - How accurate do we have to be?
- We can create data sets to test trees that we
create and assume that it will then work in the
real world
14Data Sets
- JC Model
- Sites evolve independent
- Sites change with the same probability
- Changes are single character changes
- Ie. A -gt G or T -gt C
- The expectation of change is a Poisson variable
?(e)
15More Data Sets
- K2P Model
- Based on JC Model
- Allows for probability of transitions to
tranversions - Its more likely for A and T to switch and G and
C to switch - Normally set to twice as likely
16Data Use
- Using these data sets we can create our own
evolution of data. - Start with one ancestor and create evolutions
- Plug the evolutions back and see if you get what
you started with
17Aspects of Trees
- Topology
- The method in which nodes are connected to each
other - Are we really connected to apes directly, or
just linked long before we could be considered
mammals? - Distance
- The sum of the weighted edges to reach one node
from another
18What can distance tell us?
- The distance between nodes IS the evolutionary
distance between the nodes - The distance between an ancestor and a
leaf(present day object) can be interpreted as an
estimate of the number of evolutionary steps
that occurred.
19Current Techniques
- Maximum Parsimony
- Minimize the total number of evolutionary events
- Find the tree that has a minimum amount of
changes from ancestors - Maximum Likelihood
- Probability based
- Which tree is most probable to occur based on
current data
20More Techniques
- Neighbor Joining
- Repeatedly joins pairs of leaves (or subtrees) by
rules of numerical optimization - It shrinks the distance matrix by considering two
neighbors as one node
21Learning Neighbor Joining
- It will become apparent later on, but lets learn
how to do Neighbor Joining (NJ)
A
B
C
D
E
A
B
C
D
E
22NJ Part 1
- First start with a star tree
E
A
D
B
C
23NJ Part 2
- Combine the closest two nodes (from distance
matrix) - In our case it is node A and B at distance 3
E
A
D
B
C
24NJ Part 3
- Repeat this until you have added n-2 nodes (3)
- N-2 will make it a binary tree, so we only have
to include one more node.
E
A
D
B
C
25Are we done?
- ML and MP, even in heuristic form take too long
for large data sets - NJ has poor topological accuracy, especially for
large diameter trees - We need something that works for large diameter
trees and can be run fast.
26Heres what we want
- Our Goal
- An Absolute Fast Converging Method
- ? is afc if, for all positive f,g, , on the
Model M, there is a polynomial p such that, for
all (T,?(e)) is in the set Mf,g on a set S of n
sequences of length at least p(n) generated on T,
we have Pr?(S) T gt 1- . - Simply Lets make it in polynomial time within a
degree of error.
27A DCM - NJ Solution
- 2 Phase construction of a final phylogenetic tree
given a distance matrix d. - Phase 1 Create a set of plausible trees for the
distance matrix - Phase 2 Find the best fitting tree
28Phase 1
- For each q in dij, compute a tree tq
- Let T tq q in dij
29Finding tq
- Step 1 Compute Thresh(d,q)
- Step 2 Triangulate Thresh(d,q)
- Step 3 Compute a NJ Tree for all maximal cliques
- Step 4 Merge the subtrees into a supertree
30What does that mean
- Breaking the problem up
- Create a threshold of diameters to break the
problem into - A bunch of smaller diameter trees (cliques)
- Apply NJ to those cliques
- Merge them back
31Finding tq (terms)
- Threshold Graph
- Thresh(d,q) is the threshold graph where (i,j) is
an edge if and only if dij lt q.
32Threshold
- Lets bring back our distance matrix and create a
threshold with q equal to d15 or the distance
between A and E - So q 67
33Distance Matrix
A
B
C
D
E
A
B
C
D
E
34With q D15 67
C
47
A
67
D
63
B
E
16
35Triangulating
- A graph is triangulated if any cycle with four
or more vertices has a chord - That is, an edge joining two nonconsecutive
vertices of the cycle. - Our example is already triangulated, but lets
look at another
36Triangulating
Lets say this is for q 5
10 and 15 would Not be in the graph
10
To triangulate this graph you add the edge
length 10.
15
37Maximal Cliques
- A clique that cannot be enlarged by the addition
of another vertex. - Recall our original threshold graph which is
triangulated
38Triangulated Threshold Graph
C
47
A
67
D
63
B
E
16
39Clique
- Our maximal cliques would be
- A, B, E
- C, D
40Create Trees for the Cliques
- We have two maximal cliques, so we make two
trees A, B, E and C, D - How do we make these trees?
- Remember NJ?
41Tree A, B, E and C,D
A
E
B
C
D
42Merge your separate trees together.
- Create one Supertree
- This is done by creating a minimum set of edges
in the trees and calling that the backbone - This is its own doctorial thesis, so lets do a
little hand waving
43That sounds like NP-hard!
- Computing Threshold is Polynomial
- Minimally triangulating is NP-hard, but can be
obtained in polynomial time using a greedy
heuristic without too much loss in performance. - Maximal cliques is only polynomial if the data
input is triangulated (which it is!). - If all previous are done, creating a supertree
can be done in polynomial time as well.
44Where are we now?
- We now have a finalized phylogeny created for
from smaller trees in our matrix joined together - Remember we started from all possible size of
smaller trees.
45Phase 2
- Which one is right?
- Found using the SQS (Short Quartet Support)
method - Let T be a tree in S (made from part 1)
- Break the data into sets of four taxa
- A, B, C, D A, C, D, E A, B, D, E etc
- Reduce the larger tree to only hold one set
- These are called Quartets
46SQS - A Guide
- Q(T) is the set of trees induced by T on each set
of four leaves. - Let Qw (different Q) be a set of quartets with
diameter less than or equal to w - Find the maximum w where the quartets are
inclusive of the nodes of the tree - This w is the support of that tree
47SQS - Refrased
- Qw is the set of quartet trees which have a
diameter lt w - Support of T is the max w where Qw is a subset of
Q(T) - Support is our quality measure
- What are we exactly measuring?,
48Qw
A
B
D
D
E
C
A
B
A
B
C
D
A
B
C
D
E
E
49SQS Method
- Return the tree in which the support of that tree
is the maximum. - If more than one such tree exists return the tree
found first. - This is the tree with the smallest original
diameter (remember from phase 1)
50How do we know were right?
- Compare it to the data set we created
- Look at Robinson-Foulds accuracy
- Remove one edge in the tree weve created.
- We now have two trees
- Is there anyway to create the same set of leaves
by removing one edge in our data set? - If no, add a point of error.
- Repeat this for all edges
- When the value is not zero then the trees are not
identical
51Performance of DCM - NJ
- Outperforms NJ method at sequence lengths above
4000 and with more taxa.
0.8
NJ
DCM-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
52Improvements
- Improvement possibilities like in Phase 2
- Include test of Maximum Parsimony (MP)
- Try and minimize the overall size of the tree
- Test using statistical evidence
- Maximum Likelihood (ML)
53Performance gains
- Simply changing Phase 2 has massive gains in
accuracy! - DCM - NJ MP and DCM -NJ ML are VERY accurate
for data sets greater than 4000 and are NOT NP
hard. - DCM - NJ MP finished its analysis on a 107
taxon tree in under three minutes.
54Comparing Improvements
DCM-NJSQS
0.8
NJ
DCM-NJMP
HGT-FP
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
leaves