Solving Phylogenetic Trees PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Solving Phylogenetic Trees


1
Solving Phylogenetic Trees
  • Benjamin Loyle
  • March 16, 2004
  • Cse 397 Intro to MBIO

2
Table of Contents
  • Problem Term Definitions
  • A DCM-NJ Solution
  • Performance Measurements
  • Possible Improvements

3
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
4
DNA Sequence Evolution
5
Problem Definition
  • The Tree of Life
  • Connecting all living organisms
  • All encompassing
  • Find evolution from simple beginnings
  • Even smaller relations are tough
  • Impossible
  • Infer possible ancestral history.

6
So what.
  • Genome sequencing provides entire map of a
    species, why link them?
  • We can understand evolution
  • Viable drug testing and design
  • Predict the function of genes
  • Influenza evolution

7
Why is that a problem?
  • Over 8 million organisms
  • Current solutions are NP-hard
  • Computing a few hundred species takes years
  • Error is a very large factor

8
What do we want?
  • Input
  • A collection of nodes such as taxa or protein
    strings to compare in a tree
  • Output
  • A topological link to compare those nodes to each
    other
  • When do we want it?
  • FAST!

9
Preparing the input
  • Create a distance matrix
  • Sum up all of the known distances into a matrix
    sized n x n
  • N is the number of nodes or taxa
  • Found with sequence comparison

10
Distance Matrix
Take 5 separate DNA strings A GATCCATGA B
GATCTATGC C GTCCCATTT D AATCCGATC E
TCTCGATAG The distance between A and B is 2 The
distance between A and C is 4 This is subjective
based on what your criteria are.
11
Distance Matrix
  • Lets start with an example matrix

A
B
C
D
E
A
B
C
D
E
12
Lets make it simple (constrain the input)
  • Lets keep the distance between nodes within a
    certain limit
  • From F -gt G
  • F and G have the largest distance they are the
    most dissimilar of any nodes.
  • This is called the diameter of the tree
  • Lets keep the length of the input (length of the
    strings) polynomial.

13
ERROR?!?!!?
  • All trees are inferred, how do you ever know if
    youre right?
  • How accurate do we have to be?
  • We can create data sets to test trees that we
    create and assume that it will then work in the
    real world

14
Data Sets
  • JC Model
  • Sites evolve independent
  • Sites change with the same probability
  • Changes are single character changes
  • Ie. A -gt G or T -gt C
  • The expectation of change is a Poisson variable
    ?(e)

15
More Data Sets
  • K2P Model
  • Based on JC Model
  • Allows for probability of transitions to
    tranversions
  • Its more likely for A and T to switch and G and
    C to switch
  • Normally set to twice as likely

16
Data Use
  • Using these data sets we can create our own
    evolution of data.
  • Start with one ancestor and create evolutions
  • Plug the evolutions back and see if you get what
    you started with

17
Aspects of Trees
  • Topology
  • The method in which nodes are connected to each
    other
  • Are we really connected to apes directly, or
    just linked long before we could be considered
    mammals?
  • Distance
  • The sum of the weighted edges to reach one node
    from another

18
What can distance tell us?
  • The distance between nodes IS the evolutionary
    distance between the nodes
  • The distance between an ancestor and a
    leaf(present day object) can be interpreted as an
    estimate of the number of evolutionary steps
    that occurred.

19
Current Techniques
  • Maximum Parsimony
  • Minimize the total number of evolutionary events
  • Find the tree that has a minimum amount of
    changes from ancestors
  • Maximum Likelihood
  • Probability based
  • Which tree is most probable to occur based on
    current data

20
More Techniques
  • Neighbor Joining
  • Repeatedly joins pairs of leaves (or subtrees) by
    rules of numerical optimization
  • It shrinks the distance matrix by considering two
    neighbors as one node

21
Learning Neighbor Joining
  • It will become apparent later on, but lets learn
    how to do Neighbor Joining (NJ)

A
B
C
D
E
A
B
C
D
E
22
NJ Part 1
  • First start with a star tree

E
A
D
B
C
23
NJ Part 2
  • Combine the closest two nodes (from distance
    matrix)
  • In our case it is node A and B at distance 3

E
A
D
B
C
24
NJ Part 3
  • Repeat this until you have added n-2 nodes (3)
  • N-2 will make it a binary tree, so we only have
    to include one more node.

E
A
D
B
C
25
Are we done?
  • ML and MP, even in heuristic form take too long
    for large data sets
  • NJ has poor topological accuracy, especially for
    large diameter trees
  • We need something that works for large diameter
    trees and can be run fast.

26
Heres what we want
  • Our Goal
  • An Absolute Fast Converging Method
  • ? is afc if, for all positive f,g, , on the
    Model M, there is a polynomial p such that, for
    all (T,?(e)) is in the set Mf,g on a set S of n
    sequences of length at least p(n) generated on T,
    we have Pr?(S) T gt 1- .
  • Simply Lets make it in polynomial time within a
    degree of error.

27
A DCM - NJ Solution
  • 2 Phase construction of a final phylogenetic tree
    given a distance matrix d.
  • Phase 1 Create a set of plausible trees for the
    distance matrix
  • Phase 2 Find the best fitting tree

28
Phase 1
  • For each q in dij, compute a tree tq
  • Let T tq q in dij

29
Finding tq
  • Step 1 Compute Thresh(d,q)
  • Step 2 Triangulate Thresh(d,q)
  • Step 3 Compute a NJ Tree for all maximal cliques
  • Step 4 Merge the subtrees into a supertree

30
What does that mean
  • Breaking the problem up
  • Create a threshold of diameters to break the
    problem into
  • A bunch of smaller diameter trees (cliques)
  • Apply NJ to those cliques
  • Merge them back

31
Finding tq (terms)
  • Threshold Graph
  • Thresh(d,q) is the threshold graph where (i,j) is
    an edge if and only if dij lt q.

32
Threshold
  • Lets bring back our distance matrix and create a
    threshold with q equal to d15 or the distance
    between A and E
  • So q 67

33
Distance Matrix
  • Our old example matrix

A
B
C
D
E
A
B
C
D
E
34
With q D15 67
C
47
A
67
D
63
B
E
16
35
Triangulating
  • A graph is triangulated if any cycle with four
    or more vertices has a chord
  • That is, an edge joining two nonconsecutive
    vertices of the cycle.
  • Our example is already triangulated, but lets
    look at another

36
Triangulating
Lets say this is for q 5
10 and 15 would Not be in the graph
10
To triangulate this graph you add the edge
length 10.
15
37
Maximal Cliques
  • A clique that cannot be enlarged by the addition
    of another vertex.
  • Recall our original threshold graph which is
    triangulated

38
Triangulated Threshold Graph
  • Our old Graph

C
47
A
67
D
63
B
E
16
39
Clique
  • Our maximal cliques would be
  • A, B, E
  • C, D

40
Create Trees for the Cliques
  • We have two maximal cliques, so we make two
    trees A, B, E and C, D
  • How do we make these trees?
  • Remember NJ?

41
Tree A, B, E and C,D
A
E
B
C
D
42
Merge your separate trees together.
  • Create one Supertree
  • This is done by creating a minimum set of edges
    in the trees and calling that the backbone
  • This is its own doctorial thesis, so lets do a
    little hand waving

43
That sounds like NP-hard!
  • Computing Threshold is Polynomial
  • Minimally triangulating is NP-hard, but can be
    obtained in polynomial time using a greedy
    heuristic without too much loss in performance.
  • Maximal cliques is only polynomial if the data
    input is triangulated (which it is!).
  • If all previous are done, creating a supertree
    can be done in polynomial time as well.

44
Where are we now?
  • We now have a finalized phylogeny created for
    from smaller trees in our matrix joined together
  • Remember we started from all possible size of
    smaller trees.

45
Phase 2
  • Which one is right?
  • Found using the SQS (Short Quartet Support)
    method
  • Let T be a tree in S (made from part 1)
  • Break the data into sets of four taxa
  • A, B, C, D A, C, D, E A, B, D, E etc
  • Reduce the larger tree to only hold one set
  • These are called Quartets

46
SQS - A Guide
  • Q(T) is the set of trees induced by T on each set
    of four leaves.
  • Let Qw (different Q) be a set of quartets with
    diameter less than or equal to w
  • Find the maximum w where the quartets are
    inclusive of the nodes of the tree
  • This w is the support of that tree

47
SQS - Refrased
  • Qw is the set of quartet trees which have a
    diameter lt w
  • Support of T is the max w where Qw is a subset of
    Q(T)
  • Support is our quality measure
  • What are we exactly measuring?,

48
Qw
A
B
D
D
E
C
A
B
A
B
C
D
A
B
C
D
E
E
49
SQS Method
  • Return the tree in which the support of that tree
    is the maximum.
  • If more than one such tree exists return the tree
    found first.
  • This is the tree with the smallest original
    diameter (remember from phase 1)

50
How do we know were right?
  • Compare it to the data set we created
  • Look at Robinson-Foulds accuracy
  • Remove one edge in the tree weve created.
  • We now have two trees
  • Is there anyway to create the same set of leaves
    by removing one edge in our data set?
  • If no, add a point of error.
  • Repeat this for all edges
  • When the value is not zero then the trees are not
    identical

51
Performance of DCM - NJ
  • Outperforms NJ method at sequence lengths above
    4000 and with more taxa.

0.8
NJ
DCM-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
52
Improvements
  • Improvement possibilities like in Phase 2
  • Include test of Maximum Parsimony (MP)
  • Try and minimize the overall size of the tree
  • Test using statistical evidence
  • Maximum Likelihood (ML)

53
Performance gains
  • Simply changing Phase 2 has massive gains in
    accuracy!
  • DCM - NJ MP and DCM -NJ ML are VERY accurate
    for data sets greater than 4000 and are NOT NP
    hard.
  • DCM - NJ MP finished its analysis on a 107
    taxon tree in under three minutes.

54
Comparing Improvements
DCM-NJSQS
0.8
NJ
DCM-NJMP
HGT-FP
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
leaves
Write a Comment
User Comments (0)
About PowerShow.com