Title: PHYLOGENY RECONSTRUCTION
1PHYLOGENY RECONSTRUCTION FROM QUARTETS
Gang Wu Department of Computing
Science University of Alberta
2Outline
- Introduction
- Research Methods
- Computational Results and Analysis
3Common Phylogenetic Tree Terminology
Phylogeny pattern of historical relationships
among species . Tree mathematical structure used
to depict the evolutionary history of a group of
species
Leaf Nodes
Branches or Edges
A
Represent the species (genes, populations,
etc.) used to infer the phylogeny
internal
B
C
D
ROOT of the Tree (common ancestor of all species)
E
Internal Nodes (represent hypothetical ancestors
of the species)
4Phylogeny Example for Mammal
5Rooted and Unrooted tree
B
C
B
C
Root
D
Root
D
A
Unrooted tree
A
A
B
A
C
D
C
B
B
D
Rooted tree
Root
Root
6General Process of Phylogeny Construction
Input A set of (DNA or protein) sequences for
the species
Output An evolutionary tree(phylogeny) whose
leaf nodes are the input species
Methods Maximum Parsimony (MP), Maximum
Likelyhood (ML),etc
Not suitable for large trees (over 20 species).
Current software all use heuristics to speed up
the computational time
7Quartet Based Phylogeny Construction
- There is only one unrooted tree for one, two or
three species. - There are three possible unrooted trees for four
species (A, B, C, D) - Quartets are smallest informative unrooted trees
- MP or ML can be solved exactly on quartets
ABCD
ACBD
ADBC
8Process of Quartet Based Phylogeny Construction
9Definitions
A quartet abcd is consistent with a phylogeny T,
or a phylogeny T satisfies a quartet abcd , if
and only if a,b,c,d are all leaves of T and the
path from a to b does not share any nodes with
the path from c to d.
10aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Phylogeny T
Quartet Set Q
Phylogeny T
quartet aecd is consistent with T, or T
satisfies aecd
11Definitions
Given a set of quartets Q on a set S of species,
Q is compatible, if and only if there is a
phylogeny on S which satisfies all the quartets
in Q.
A set Q of quartet topologies is complete if Q
contains a quartet topology for each four labels
over label set S.
12aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet Set Q
Phylogeny T
- The quartet set Q is compatible
- The quartet set Q is complete
13Problem Descriptions
In practice, the given quartet set Q usually
contains errors and thus is incompatible.
Quartet Compatibility Problem(QCP) Input A
set Q of quartets on S Question Is Q
compatible? Equivalently, is there a phylogeny T
on S such that all quartets in Q are satisfied?
Maximum Quartet Consistency Problem (MQC) Input
A set Q of quartets on S. Goal Find a phylogeny
T on S such that the number of consistent
quartets in Q is maximized.
Minimum Quartet Inconsistency Problem
(MQI) Input A set Q of quartets on S. Goal
Find a phylogeny T on S such that the number of
inconsistent quartets in Q is minimized.
14aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Input Quartet Set Q
Quartet Compatibility Problem(QCP)?
No
MQC or MQI ?
Only aced is not satisfied
15Known Results
Quartet Compatibility Problem(QCP) can be solved
in polynomial time if the given quartet set Q is
complete. But it is NP-Complete if Q is
incomplete.
Maximum Quartet Consistency Problem (MQC) and
Minimum Quartet Inconsistency Problem (MQI) are
NP-Complete even if Q is complete.
Exact algorithms "Guarantee" to find the
optimal or "best" tree. Heuristic algorithms
Approximate or quick-and-dirty methods that
attempt to find the optimal tree, but cannot
guarantee to do so.
16Known Results
Lots of Heuristics. Best known approximation
algorithm is Quartet Cleaning, with approximation
ratio of for MQI, where n is number of
species
There are only two exact algorithms in
literature. Dynamic programming has the
complexity of , where m is the
number of input quartets and n is number of input
species. It is a general algorithm. Fixed
Parameter Algorithm has the complexity of
, where k is the largest number of
quartet errors and n is the number of input
species. Good if k is very small compared to the
total number of quartets. Worse than Dynamic
Programming if k is relatively large.
Dynamic programming can solve MQC problem with 20
species in 6 days in a 300MHz computer. Fixed
Parameter Algorithm can solve MQI problem with 50
species when k 100 in 40 minutes in a 750MHz
computer.
17Research Objectives
- Exact algorithm for MQC
- Quartet set Q is complete
- Faster
- Can solve problem with more species
18Ultrametric Tree and Matrix
Ultrametric Tree We label each internal node
with a number. If along any root to leaf path,
the labels of the internal nodes on the path is
strictly decreasing, then the tree with its
labels is called ultrametric tree.
Ultrametric Matrix Each entry value is the label
of least common ancestor of the two leaf nodes.
It is
- Symmetric, M(i, i) 0 and
- For every triplet (i, j, k) there are two equal
values among - M(i, j), M(j, k), and M(i, k) and they are
greater than the third value.
e.g. i1, j3, k4, M(1, 3)M(3, 4)gt M(1, 4)
19Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
20Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
s1 s5 s2 s3 is consistent with the tree and
its corresponding matrix min M(1, 2), M(5,
3)4 gt minM(1, 5), M(2, 3)1. Condition
satisfied!
21Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
s1 s4 s2 s5 is NOT consistent with the tree and
its corresponding matrix min M(1, 2), M(4, 5)
minM(1, 4), M(2, 5)3. Condition not
satisfied!
22Theorem 2 Given a set Q of quartets on a set of
species S and an ultrametric phylogeny T on S, T
satisfies the maximum number of quartets in Q if
and only if the corresponding ultrametric matrix
M on S satisfies the maximum number of quartets
in Q.
We transfer the original MQC problem into an
ultrametric matrix searching problem
23(No Transcript)
24Formulation in Answer Set Programming
Domain
1m(1, 2, 1),m(1, 2, 2),m(1, 2, 3),m(1, 2,
4),m(1, 2, 5)1 matrix entry (1,2) takes exactly
one value in the domain 1,5
Ultrametric Constraints
for three matrix values, m(i,j), m(j,k) and
m(i,k), two of them are equal and greater than
the third one
Quartet Constraints
if minm(i,k),m(j,l)gtminm(i,j),m(k,l) then
quartet i,jk,l is satisfied
Objective
maximize q(i,j,k,l)
25Optimizations
26Experiment Results
n number of species p percentage of quartet
errors
27Phylogenetic Analysis on Prokaryote Dataset
- 20 species, total 4845 quartets.
- is generated by PHYLIP using Neighbor Joining. It
can satisfy 3968 quartets. - is generated by our program, it can satisfy 3984
quartets and more accurate w.r.t. Bergeys Code
28Phylogenetic Analysis of SARS
- Severe Acute Respiratory Syndrome (SARS) is
recognized as a coronavirus - The coronaviruses are currently divided into
three groups - The representative viruses from each group are
shown as -
29Phylogeny Construction Procedure
- Get the whole whole genome data and protein data
for each virus from NCBI - Compute a distance matrix M for these viruses
using a measure proposed by Xiaomeng - Use the quartet-based algorithm to generate a
phylogeny from M. - Use Neighbor Joining Algorithm in PHYLIP package
to generate another phylogeny from M. - Compute the average distance between SARS-Cov
and Group 1(D1), Group 2(D2), and Group 3(D3)
viruses, respectively.
30Phylogeny on Protein Data without Outgroup
Both Neighbor Joining and Quartet-based methods
generate the same phylogeny
D1466.3 D2459.8 D3460.6
31Phylogeny on Protein Data with Outgroup
By Neighbor Joining, the relation of SARS-Cov to
Group 2 and Group 3 varies from tree to tree. The
following is a phylogeny, where SARS-Cov lies in
Group 3.
D1464.4 D2460.3 D3459
32Phylogeny on Protein Data with Outgroup
By Neighbor Joining, the relation of SARS-Cov to
Group 2 and Group 3 varies from tree to tree.
The following is another tree where SARS-Cov lies
in Group 2.
D1464.5 D2458.8 D3459.7
33Phylogeny on Protein Data with Outgroup
The Quartet-based Method can produce consistent
phylogeny on various outgroups
D1464.4 D2460.3 D3459
34Phylogeny on Genome Data with Outgroup
Both Neighbor Joining and Quartet-based methods
generate the same phylogeny
D1457.4 D2456.1 D3455.2
35Summary
- Our phylogeny construction method can
successfully identify three groups of
coronaviruses. - SARS-Cov locates more closely to group 2 and 3
than group 1. The average distances of SARS-Cov
to the group 2 and 3 viruses are approximately
same. - Based on whole protein data, our quartet-based
method can consistently generate same phylogeny
with various outgroups. This phylogeny suggests
that SARS-Cov lies more likely in the group 3.