Title: PHYLOGENY RECONSTRUCTION
1PHYLOGENY RECONSTRUCTION FROM QUARTETS
Jia-Huai You Department of Computing
Science University of Alberta
2Outline
- Introduction
- Research Methods
- Computational Results and Analysis
3Common Evolutionary Tree Terminology
Phylogeny pattern of historical relationships
among species . Tree mathematical structure used
to depict the evolutionary history of a group of
species
Leaf Nodes
Branches or Edges
A
Represent the species (genes, populations,
etc.) used to infer the phylogeny
internal
B
C
D
ROOT of the Tree (common ancestor of all species)
E
Internal Nodes (represent hypothetical ancestors
of the species)
4Phylogeny Example for Mammal
5Rooted and Unrooted tree
6General Process of Phylogeny Construction
Input A set of (DNA or protein) sequences for
the species
Output An evolutionary tree(phylogeny) whose
leaf nodes are the input species
Methods Maximum Parsimony (MP), Maximum
Likelyhood (ML),etc
Not suitable for large trees (over 20 species).
Current software all use heuristics to speed up
the computational time
7Quartet Based Phylogeny Construction
- There is only one unrooted tree for one, two or
three species. - There are three possible unrooted trees for four
species (A, B, C, D) - Quartets are smallest informative unrooted trees
- MP or ML can be solved exactly on quartets
ABCD
ACBD
ADBC
8Process of Quartet Based Phylogeny Construction
9Definitions
A quartet abcd is consistent with a phylogeny T,
or a phylogeny T satisfies a quartet abcd , if
and only if a,b,c,d are all leaves of T and the
path from a to b does not share any nodes with
the path from c to d.
10aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Phylogeny T
Quartet Set Q
Phylogeny T
quartet aecd is consistent with T, or T
satisfies aecd
11Definitions
Given a set of quartets Q on a set S of species,
Q is compatible, if and only if there is a
phylogeny on S which satisfies all the quartets
in Q.
A set Q of quartet topologies is complete if Q
contains a quartet topology for each four labels
over label set S.
12aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet Set Q
Phylogeny T
- The quartet set Q is compatible
- The quartet set Q is complete
13Problem Descriptions
In practice, the given quartet set Q usually
contains errors and thus is incompatible.
Quartet Compatibility Problem(QCP) Input A
set Q of quartets on S Question Is Q
compatible? Equivalently, is there a phylogeny T
on S such that all quartets in Q are satisfied?
Maximum Quartet Consistency Problem (MQC) Input
A set Q of quartets on S. Goal Find a phylogeny
T on S such that the number of consistent
quartets in Q is maximized.
Minimum Quartet Inconsistency Problem
(MQI) Input A set Q of quartets on S. Goal
Find a phylogeny T on S such that the number of
inconsistent quartets in Q is minimized.
14aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Input Quartet Set Q
Quartet Compatibility Problem(QCP)?
No
MQC or MQI ?
Only aced is not satisfied
15Known Results
Quartet Compatibility Problem(QCP) can be solved
in polynomial time if the given quartet set Q is
complete. But it is NP-Complete if Q is
incomplete.
Maximum Quartet Consistency Problem (MQC) and
Minimum Quartet Inconsistency Problem (MQI) are
NP-Complete even if Q is complete.
Exact algorithms "Guarantee" to find the
optimal or "best" tree. Heuristic algorithms
Approximate or quick-and-dirty methods that
attempt to find the optimal tree, but cannot
guarantee to do so.
16Known Results
Lots of Heuristics. Best known approximation
algorithm is Quartet Cleaning, with approximation
ratio of for MQI, where n is number of
species
There are only two exact algorithms in
literature. Dynamic programming has the
complexity of , where m is the
number of input quartets and n is number of input
species. It is a general algorithm. Fixed
Parameter Algorithm has the complexity of
, where k is the largest number of
quartet errors and n is the number of input
species. Good if k is very small compared to the
total number of quartets. Worse than Dynamic
Programming if k is relatively large.
Dynamic programming can solve MQC problem with 20
species in 6 days in a 300MHz computer. Fixed
Parameter Algorithm can solve MQI problem with 50
species when k 100 in 40 minutes in a 750MHz
computer.
17Research Objectives
- Exact algorithm for MQC
- Quartet set Q is complete
- Faster
- Can solve problem with more species
18Ultrametric Tree and Matrix
Ultrametric Tree We label each internal node
with a number. If along any root to leaf path,
the labels of the internal nodes on the path is
strictly decreasing, then the tree with its
labels is called ultrametric tree.
Ultrametric Matrix Each entry value is the label
of least common ancestor of the two leaf nodes.
It is
- Symmetric, M(i, i) 0 and
- For every triplet (i, j, k) there are two equal
values among - M(i, j), M(j, k), and M(i, k) and they are
greater than the third value.
e.g. i1, j3, k4, M(1, 3)M(3, 4)gt M(1, 4)
19Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
20Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
s1 s5 s2 s3 is consistent with the tree and
its corresponding matrix min M(1, 2), M(5,
3)4 gt minM(1, 5), M(2, 3)1. Condition
satisfied!
21Theorem 2 Given a set Q of quartets on a set of
species S and an ultrametric phylogeny T on S, T
satisfies the maximum number of quartets in Q if
and only if the corresponding ultrametric matrix
M on S satisfies the maximum number of quartets
in Q.
We transfer the original MQC problem into an
ultrametric matrix searching problem
22(No Transcript)
23Formulation in Answer Set Programming
Domain
1m(1, 2, 1),m(1, 2, 2),m(1, 2, 3),m(1, 2,
4),m(1, 2, 5)1 matrix entry (1,2) takes exactly
one value in the domain 1,5
Ultrametric Constraints
for three matrix values, m(i,j), m(j,k) and
m(i,k), two of them are equal and greater than
the third one
Quartet Constraints
if minm(i,k),m(j,l)gtminm(i,j),m(k,l) then
quartet i,jk,l is satisfied
Objective
maximize q(i,j,k,l)
24Optimizations
25Experiment Results
n number of species p percentage of quartet
errors
26Phylogenetic Analysis of SARS
- Severe Acute Respiratory Syndrome (SARS) is
recognized as a coronavirus - The coronaviruses are currently divided into
three groups - The representative viruses from each group are
shown as -
27Phylogeny Construction Procedure
- Get the whole whole genome data and protein data
for each virus from NCBI website. - Compute a distance matrix M for these viruses.
- Use the quartet-based algorithm to generate a
phylogeny from M. - Compute the average distance between SARS-Cov
and Group 1(D1), Group 2(D2), and Group 3(D3)
viruses, respectively.
28Phylogeny on Protein Data with Outgroup
The following is a phylogeny, where SARS-Cov lies
in Group 3.
OUT-GROUP
Â
Â
MHV
TGEV
D1464.4 D2460.3 D3459
253
HCov-229E
GROUP 2
216
180
196
2.8
GROUP 1
56
124 Â
16.2
BCov
11.4
50
9.5
196
55
0.35
HCov-OC43
205
Hcov-NL63
229
PEDV
230
IBV
SARS-Cov
GROUP 3
29Phylogeny on Genome Data with Outgroup
Based on genome data, we can see that SARS lies
in an individual group, but a bit more close to
group 2 and 3.
D1457.4 D2456.1 D3455.2
30Summary
- Our phylogeny construction method can
successfully identify three groups of
coronaviruses. - SARS-Cov locates more closely to group 2 and 3
than group 1. The average distances of SARS-Cov
to the group 2 and 3 viruses are approximately
same. - Our quartet-based method can consistently
generate same phylogeny with various outgroups.
This phylogeny suggests that SARS-Cov lies more
likely in the group 3.