Solving Phylogenetic Trees presentation

About This Presentation

Transcript and Presenter's Notes

Title: Solving Phylogenetic Trees

1
Solving Phylogenetic Trees

Benjamin Loyle
March 16, 2004
Cse 397 Intro to MBIO

2
Table of Contents

Problem Term Definitions
A DCM-NJ Solution
Performance Measurements
Possible Improvements

3
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
4
DNA Sequence Evolution
5
Problem Definition

The Tree of Life
Connecting all living organisms
All encompassing
Find evolution from simple beginnings
Even smaller relations are tough
Impossible
Infer possible ancestral history.

6
So what.

Genome sequencing provides entire map of a
species, why link them?
We can understand evolution
Viable drug testing and design
Predict the function of genes
Influenza evolution

7
Why is that a problem?

Over 8 million organisms
Current solutions are NP-hard
Computing a few hundred species takes years
Error is a very large factor

8
What do we want?

Input
A collection of nodes such as taxa or protein
strings to compare in a tree
Output
A topological link to compare those nodes to each
other
When do we want it?
FAST!

9
Preparing the input

Create a distance matrix
Sum up all of the known distances into a matrix
sized n x n
N is the number of nodes or taxa
Found with sequence comparison

10
Distance Matrix
Take 5 separate DNA strings A GATCCATGA B
GATCTATGC C GTCCCATTT D AATCCGATC E
TCTCGATAG The distance between A and B is 2 The
distance between A and C is 4 This is subjective
based on what your criteria are.
11
Distance Matrix

Lets start with an example matrix

A
B
C
D
E
A
B
C
D
E
12
Lets make it simple (constrain the input)

Lets keep the distance between nodes within a
certain limit
From F -gt G
F and G have the largest distance they are the
most dissimilar of any nodes.
This is called the diameter of the tree
Lets keep the length of the input (length of the
strings) polynomial.

13
ERROR?!?!!?

All trees are inferred, how do you ever know if
youre right?
How accurate do we have to be?
We can create data sets to test trees that we
create and assume that it will then work in the
real world

14
Data Sets

JC Model
Sites evolve independent
Sites change with the same probability
Changes are single character changes
Ie. A -gt G or T -gt C
The expectation of change is a Poisson variable
?(e)

15
More Data Sets

K2P Model
Based on JC Model
Allows for probability of transitions to
tranversions
Its more likely for A and T to switch and G and
C to switch
Normally set to twice as likely

16
Data Use

Using these data sets we can create our own
evolution of data.
Start with one ancestor and create evolutions
Plug the evolutions back and see if you get what
you started with

17
Aspects of Trees

Topology
The method in which nodes are connected to each
other
Are we really connected to apes directly, or
just linked long before we could be considered
mammals?
Distance
The sum of the weighted edges to reach one node
from another

18
What can distance tell us?

The distance between nodes IS the evolutionary
distance between the nodes
The distance between an ancestor and a
leaf(present day object) can be interpreted as an
estimate of the number of evolutionary steps
that occurred.

19
Current Techniques

Maximum Parsimony
Minimize the total number of evolutionary events
Find the tree that has a minimum amount of
changes from ancestors
Maximum Likelihood
Probability based
Which tree is most probable to occur based on
current data

20
More Techniques

Neighbor Joining
Repeatedly joins pairs of leaves (or subtrees) by
rules of numerical optimization
It shrinks the distance matrix by considering two
neighbors as one node

21
Learning Neighbor Joining

It will become apparent later on, but lets learn
how to do Neighbor Joining (NJ)

A
B
C
D
E
A
B
C
D
E
22
NJ Part 1

First start with a star tree

E
A
D
B
C
23
NJ Part 2

Combine the closest two nodes (from distance
matrix)
In our case it is node A and B at distance 3

E
A
D
B
C
24
NJ Part 3

Repeat this until you have added n-2 nodes (3)
N-2 will make it a binary tree, so we only have
to include one more node.

E
A
D
B
C
25
Are we done?

ML and MP, even in heuristic form take too long
for large data sets
NJ has poor topological accuracy, especially for
large diameter trees
We need something that works for large diameter
trees and can be run fast.

26
Heres what we want

Our Goal
An Absolute Fast Converging Method
? is afc if, for all positive f,g, , on the
Model M, there is a polynomial p such that, for
all (T,?(e)) is in the set Mf,g on a set S of n
sequences of length at least p(n) generated on T,
we have Pr?(S) T gt 1- .
Simply Lets make it in polynomial time within a
degree of error.

27
A DCM - NJ Solution

2 Phase construction of a final phylogenetic tree
given a distance matrix d.
Phase 1 Create a set of plausible trees for the
distance matrix
Phase 2 Find the best fitting tree

28
Phase 1

For each q in dij, compute a tree tq
Let T tq q in dij

29
Finding tq

Step 1 Compute Thresh(d,q)
Step 2 Triangulate Thresh(d,q)
Step 3 Compute a NJ Tree for all maximal cliques
Step 4 Merge the subtrees into a supertree

30
What does that mean

Breaking the problem up
Create a threshold of diameters to break the
problem into
A bunch of smaller diameter trees (cliques)
Apply NJ to those cliques
Merge them back

31
Finding tq (terms)

Threshold Graph
Thresh(d,q) is the threshold graph where (i,j) is
an edge if and only if dij lt q.

32
Threshold

Lets bring back our distance matrix and create a
threshold with q equal to d15 or the distance
between A and E
So q 67

33
Distance Matrix

Our old example matrix

A
B
C
D
E
A
B
C
D
E
34
With q D15 67
C
47
A
67
D
63
B
E
16
35
Triangulating

A graph is triangulated if any cycle with four
or more vertices has a chord
That is, an edge joining two nonconsecutive
vertices of the cycle.
Our example is already triangulated, but lets
look at another

36
Triangulating
Lets say this is for q 5
10 and 15 would Not be in the graph
10
To triangulate this graph you add the edge
length 10.
15
37
Maximal Cliques

A clique that cannot be enlarged by the addition
of another vertex.
Recall our original threshold graph which is
triangulated

38
Triangulated Threshold Graph

Our old Graph

C
47
A
67
D
63
B
E
16
39
Clique

Our maximal cliques would be
A, B, E
C, D

40
Create Trees for the Cliques

We have two maximal cliques, so we make two
trees A, B, E and C, D
How do we make these trees?
Remember NJ?

41
Tree A, B, E and C,D
A
E
B
C
D
42
Merge your separate trees together.

Create one Supertree
This is done by creating a minimum set of edges
in the trees and calling that the backbone
This is its own doctorial thesis, so lets do a
little hand waving

43
That sounds like NP-hard!

Computing Threshold is Polynomial
Minimally triangulating is NP-hard, but can be
obtained in polynomial time using a greedy
heuristic without too much loss in performance.
Maximal cliques is only polynomial if the data
input is triangulated (which it is!).
If all previous are done, creating a supertree
can be done in polynomial time as well.

44
Where are we now?

We now have a finalized phylogeny created for
from smaller trees in our matrix joined together
Remember we started from all possible size of
smaller trees.

45
Phase 2

Which one is right?
Found using the SQS (Short Quartet Support)
method
Let T be a tree in S (made from part 1)
Break the data into sets of four taxa
A, B, C, D A, C, D, E A, B, D, E etc
Reduce the larger tree to only hold one set
These are called Quartets

46
SQS - A Guide

Q(T) is the set of trees induced by T on each set
of four leaves.
Let Qw (different Q) be a set of quartets with
diameter less than or equal to w
Find the maximum w where the quartets are
inclusive of the nodes of the tree
This w is the support of that tree

47
SQS - Refrased

Qw is the set of quartet trees which have a
diameter lt w
Support of T is the max w where Qw is a subset of
Q(T)
Support is our quality measure
What are we exactly measuring?,

48
Qw
A
B
D
D
E
C
A
B
A
B
C
D
A
B
C
D
E
E
49
SQS Method

Return the tree in which the support of that tree
is the maximum.
If more than one such tree exists return the tree
found first.
This is the tree with the smallest original
diameter (remember from phase 1)

50
How do we know were right?

Compare it to the data set we created
Look at Robinson-Foulds accuracy
Remove one edge in the tree weve created.
We now have two trees
Is there anyway to create the same set of leaves
by removing one edge in our data set?
If no, add a point of error.
Repeat this for all edges
When the value is not zero then the trees are not
identical

51
Performance of DCM - NJ

Outperforms NJ method at sequence lengths above
4000 and with more taxa.

0.8
NJ
DCM-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
52
Improvements

Improvement possibilities like in Phase 2
Include test of Maximum Parsimony (MP)
Try and minimize the overall size of the tree
Test using statistical evidence
Maximum Likelihood (ML)

53
Performance gains

Simply changing Phase 2 has massive gains in
accuracy!
DCM - NJ MP and DCM -NJ ML are VERY accurate
for data sets greater than 4000 and are NOT NP
hard.
DCM - NJ MP finished its analysis on a 107
taxon tree in under three minutes.

54
Comparing Improvements
DCM-NJSQS
0.8
NJ
DCM-NJMP
HGT-FP
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
leaves

Write a Comment

User Comments (0)

About PowerShow.com

Solving Phylogenetic Trees PowerPoint PPT Presentation