Phylogenetic Networks with Constrained Recombination - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogenetic Networks with Constrained Recombination

Description:

Proof attempt by splitting blob B into B' and B'. The catch is the possibility of a recombination node in B that is needed in both B' and B' ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 66
Provided by: DanGus8
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Networks with Constrained Recombination


1
(No Transcript)
2
Optimal Phylogenetic Networks with Constrained
and Unconstrained Recombination
  • Dan Gusfield
  • UC Davis

Different parts of this work are joint with
Satish Eddhu, Charles Langley, Dean Hickerson,
Yufeng Wu.
3
Reconstructing the Evolution of Binary
Bio-Sequences
  • Perfect Phylogeny (tree) model
  • Phylogenetic Networks (DAG) with recombination
  • Phylogenetic Networks with disjoint cycles
    Galled-Trees
  • Phylogenetic Networks with unconstrained cycles
    Blobbed-Trees
  • Combinatorial Structure and Efficient Algorithms
  • Efficiently Computed Lower Bounds on the number
    of recombinations needed

4
Geneological or Phylogenetic Networks
  • The major biological motivation comes from
    genetics and attempts to reconstruct the history
    of recombination in populations.
  • Also relates to phylogenetic-based haplotyping.
  • Some of the algorithmic and mathematical results
    also have phylogenetic applications, for example
    in hybrid speciation, lateral gene transfer.

5
The Perfect Phylogeny Model for binary sequences

sites
12345
00000
Ancestral sequence
1
4
Site mutations on edges
3
00010
The tree derives the set M 10100 10000 01011 0101
0 00010
2
10100
5
10000
01010
01011
Extant sequences at the leaves
6
Why SNPs?
SNPs imply that the sequences are binary, and
that the order of the sites is fixed (on a
chromosome). This is in contrast to a set of
taxonomic characters, where the order is
arbitrary.
7
The converse problem
Given a set of sequences M we want to find, if
possible, a perfect phylogeny that derives M.
Remember that each site can change state from 0
to 1 only once.
n will denote the number of sequences in M, and m
will denote the length of each sequence in M.
m
01101001 11100101 10101011
M
n
8
When can a set of sequences be derived on a
perfect phylogenywith the all-0 root?
  • Classic NASC Arrange the sequences in a matrix.
    Then (with no duplicate columns), the sequences
    can be generated on a unique perfect phylogeny if
    and only if no two columns (sites) contain all
    four pairs
  • 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test
9
A richer model

10100 10000 01011 01010 00010 10101 added
12345
00000
1
4
3
00010
2
10100
5
pair 4, 5 fails the three gamete-test. The sites
4, 5 conflict.
10000
01010
01011
Real sequence histories often involve
recombination.
10
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point
5.
The first 4 sites come from P (Prefix) and the
sites from 5 onward come from S (Suffix).
11
Network with Recombination

10100 10000 01011 01010 00010 10101 new
12345
00000
1
4
3
00010
2
10100
5
10000
P
01010
The previous tree with one recombination event
now derives all the sequences.
01011
5
S
10101
12
Multiple Crossover Recombination
4-crossovers
2-crossovers gene conversion
13
Elements of a Phylogenetic Network (single
crossover recombination)
  • Directed acyclic graph.
  • Integers from 1 to m written on the edges. Each
    integer written only once. These represent
    mutations.
  • A choice of ancestral sequence at the root.
  • Every non-root node is labeled by a sequence
    obtained from its parent(s) and any edge label on
    the edge into it.
  • A node with two edges into it is a
    recombination node, with a recombination point
    r. One parent is P and one is S.
  • The network derives the sequences that label the
    leaves.

14
A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
15
Which Phylogenetic Networks are meaningful?
  • Given M we want a phylogenetic network that
    derives M, but which one?

A A perfect phylogeny (tree) if possible. As
little deviation from a tree, if a tree is not
possible. Use as little recombination or
gene-conversion as possible.
16
Minimizing recombinations
  • Any set M of sequences can be generated by a
    phylogenetic network with enough recombinations,
    and one mutation per site. This is not
    interesting or useful.
  • However, the number of (observable)
    recombinations is small in realistic sets of
    sequences. Observable depends on n and m
    relative to the number of recombinations.
  • Two algorithmic problems given a set of
    sequences M, find a phylogenetic network
    generating M, minimizing the number of
    recombinations (Heins problem). Find a network
    generating M that has some biologically-motivated
    structural properties.

17
Minimization is NP-hard
  • The problem of finding a phylogenetic network
    that creates a given set of sequences M, and
    minimizes the number of recombinations, is
    NP-hard. (Wang et al 2000) (Semple 2004)
  • Wang et al. explored the problem of finding a
    phylogenetic network where the recombination
    cycles are required to be node disjoint, if
    possible.
  • They gave a sufficient but not a necessary
    condition to recognize cases when this is
    possible. O(nm n4) time.

18
Recombination Cycles
  • In a Phylogenetic Network, with a recombination
    node x, if we trace two paths backwards from x,
    then the paths will eventually meet.
  • The cycle specified by those two paths is called
    a recombination cycle.

19
Galled-Trees
  • A recombination cycle in a phylogenetic network
    is called a gall if it shares no node with any
    other recombination cycle.
  • A phylogenetic network is called a galled-tree
    if every recombination cycle is a gall.

20
A galled-tree generating the sequences
generated by the prior network.
4
3
1
s
p
a 00010
3
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
21
(No Transcript)
22
Sales pitch for Galled-Trees
  • Galled-trees represent a small deviation from
    true trees.
  • There are sufficient applications where it is
    plausible that a galled tree
  • exists that generates the sequences.
  • Observable recombinations tend to be recent
    block structure of human DNA recombination is
    sparse, so the true history of observable
  • recombinations may be a galled-tree.

The number of recombinations is never more than
m/2. Moreover, when M can be derived on a
galled-tree, the number of recombinations used
is the minimum number over any phylogenetic
network, even if multiple cross-overs at a
recombination event are counted as a single
recombination. A galled-tree for M is almost
unique - implications for reconstructing
the correct history.
23
Old (Aug. 2003) Results
  • O(nm n3)-time algorithm to determine whether
    or not M can be derived on a galled-tree with
    all-0 ancestral sequence.
  • Proof that the galled-tree produced by the
    algorithm is a nearly-unique solution.
  • Proof that the galled-tree (if one exists)
    produced by the algorithm minimizes the number of
    recombinations used, over all phylogenetic-network
    s with all-0 ancestral sequence.

24
New work
  • We derive the galled-tree results in a more
  • general setting that addresses unconstrained
    recombination cycles and multiple
  • crossover recombination. This also solves the
  • problem of finding the most tree-like
  • network when a perfect phylogeny is not
  • possible. In this algorithm, no ancestral
    sequence is known in advance.

25
Blobbed-trees generalizing galled-trees
  • In a phylogenetic network a maximal set of
    intersecting cycles is called a blob.
  • Contracting each blob results in a directed,
    rooted tree, otherwise one of the blobs was not
    maximal.
  • So every phylogenetic network can be viewed as a
    directed tree of blobs - a blobbed-tree.
  • The blobs are the non-tree-like parts of the
    network.

26
Every network is a tree of blobs. How do the
tree parts and the blobs relate?
How can we exploit this relationship?
Ugly tangled network inside the blob.
27
The start of technical stuff
28
Incompatible Sites
  • A pair of sites (columns) of M that fail the
  • 4-gametes test are said to be incompatible.
  • A site that is not in such a pair is compatible.

29
1 2 3 4 5
Incompatibility Graph
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
M
1
3
2
5
Two nodes are connected iff the pair of sites are
incompatible, i.e, fail the 4-gamete test.
THE MAIN TOOL We represent the pairwise
incompatibilities in a incompatibility graph.
30
Simple Fact
  • If sites two sites i and j are incompatible,
    then the sites must be together on some
    recombination cycle whose recombination point is
    between the two sites i and j.
  • (This is a general fact for all phylogenetic
    networks.)

Ex In the prior example, sites 1, 3 are
incompatible, as are 1, 4 as are 2, 5.
31
A Phylogenetic Network
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
32
Simple Consequence of the simple fact
  • All sites on the same (non-trivial) connected
    component of the incompatibility graph
  • must be on the same blob in any blobbed-tree.
  • Follows by transitivity.
  • So we cant subdivide a blob into a tree-like
    structure if it only contains sites from a single
    connected component of the incompatibility graph.

33
Key Result about Galls For galls, the converse
of the simple consequence is also true.
  • Two sites that are in different (non-trivial)
    connected
  • components cannot be placed on the same gall in
  • any phylogenetic network for M.
  • Hence, in a galled-tree T for M each gall
    contains all and only the sites of one
    (non-trivial) connected component of the
    incompatibility graph. All compatible sites can
    be put on edges outside of the galls.

This was the key to the galled-tree solution.
34
Incompatibility Graph
A galled-tree generating the sequences
generated by the prior network.
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
35
Reduced Galled-Trees
  • A galled-tree is called reduced if every gall
    only contains conflicted sites.
  • Theorem If M can be derived on a galled-tree, it
    can be derived on a reduced galled-tree.
  • The number of recombination nodes in a reduced
    galled-tree equals the number of connected
    components of the conflict graph, which is the
    minimum number of recombinations possible in any
    galled-tree.

36
  • A reduced galled-tree for M (if one exists)
    minimizes the number of recombinations used over
    any phylogenetic network for M. The initial proof
    is based on a lower bound method.
  • Another proof (together with Dean Hickerson) The
    minimum number of recombinations needed in any
    phylogenetic network for M is at least the number
    of non-trivial connected components of the
    conflict graph for M.

37
The conflict graph G and a blobbed tree N
  • A conflicted pair of sites must be together on
    some single cycle in N (as before).
  • By transitivity, all the sites in a connected
    component of the conflict graph must be together
    on some single blob of N.
  • Does the 1 - 1 correspondence between connected
    components of G and galls in a galled-tree
    generalize to blobbed trees?

38
The main new result The Partition Theorem
  • For any set of sequences M, there is a
    blobbed-tree T(M) that derives M, where each
    blob contains all and only the sites in one
    non-trivial connected component of the
    incompatibility graph. The compatible sites can
    always be put on edges outside of any blob.
    Moreover, the tree part of T(M) is unique. And it
    is easy to find the tree part.

This is weaker than the result for galled-trees
it replaces must with can.
39
My proof is direct and self-contained, but the
result may also be derivable from
well-established results about splits, for
example, from facts known about Bunemann graphs.
Thanks to Mike Steel for pointing this out. Can
also be derived from earlier work on splits by
Bandelt and Dress.
40
What does T(M) look like? How do we construct it?
How do we exploit it?
  • Let T(M) be the tree where each blob in T(M) has
    been contracted to a single node.
  • Finding T(M) is easy from M. Then we expand
    each node in T(M) to get T(M).

41
How to find T(M)
  • For a connected component C in the conflict
  • graph, let MC be M restricted to the sites in
    C.
  • Consider each distinct sequence S in MC as
    defining a binary character (split) partitioning
    those rows of MC that contain S, and those that
    do not.
  • Let M be the binary matrix obtained from these
    characters, over all connected components in the
    conflict graph.

42
Fact M satisfies the 4-gamete test and so has
a unique undirected perfect phylogeny, which is
T(M).
Then we need to expand each node in T(M) that
corresponds to a blob for connected component C
to some network B so that the external nodes in
B have the labels in MC, when restricted to C.
43
T(M)
v
sites on B from C
Leaf set Tv
B
w
Key point
For any leaf in Tv, and only at those leaves, the
states of the sites in C are the same as at v.
B blob C component of the conflict graph whose
sites are on B
44
Example
4
3
134 010
v
1
s
p
a 00010
3
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
45
More Formally Let MC be the matrix M
restricted to the sites in C. Let SC be a
sequence S restricted to the sites in C. Key
point Each distinct (non-zero) sequence in MC
must be the sequence SC for some sequence S
labeling a branching node v on B. So, although
we dont know much about the interior of B, or
the arrangement of the exterior nodes (braching
off of B), we know precisely their number and the
sequences (restricted to sites of C) that label
the exterior nodes on B. And we know that the
states of the non-C sites are identical at each
node in B.
46
1 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C
M
1
3
000
4
B
3
001
1 3 4
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
010
1
s
101
p
a
Matrix MC is Matrix M restricted to the columns
in C.
2
110
b
c, e, f, g
d
47
Punch Line
  • Each distinct non-zero sequence SC in MC
    corresponds to an edge e branching off of B, and
    exactly defines the set of sequences in M (and
    hence leaves in T(M)) that must be below e in
    T(M).
  • If SC is the all-zero sequence, then the
  • leaf labeled S must connect to B through the
    coalesenct node of B.
  • These two facts allow the construction of the
    tree part of T(M) in O(nm) time, starting from M.
    It
  • also defines the complete label of each exterior
    node.

48
1 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C2
C1
M
1
3
2
5
B1
B2
1 3 4
2 5
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
So the paths to every leaf pass through the blob
B1, but only the paths to e, f, g pass
through gall B2. The path from B1 to B2 exits B1
at the node whose C1-restricted label is 010.
MC1
MC2
49
1 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C2
C1
M
1
3
2
5
1 2 3 4 5 6 7 8
1 3 4
2 5
a b c d e f g
1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0
1 0 0 0 0 1
a b c d e f g
0 0 0 0 0 0 0 0 1 0 1 1 0 1
5 5 5 5 6 7 8
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
1 2 3 4 3 3 3
M
MC1
MC2
50
Algorithmically
  • Finding the tree part of the blobbed-tree is
    easy.
  • Determining the sequences labeling the exterior
    nodes on any blob is easy.
  • Determining a good structure inside a blob B is
    the problem of generating the sequences of the
    exterior nodes of B.
  • It is easy to test whether the exterior sequences
    on B can be generated with only a single
    (possibly multiple-crossover) recombination. The
    original galled-tree problem is now just the
    problem of testing whether one single-crossover
    recombination is sufficient for each blob.

51
The main open question
  • Is there always a blobbed-tree where each blob
    has all and only the sites of a single connected
    component of the conflict graph, which minimizes
    the number of recombinations over all possible
    phylogenetic networks for M?
  • Proof attempt by splitting blob B into B and B.
    The catch is the possibility of a recombination
    node in B that is needed in both B and B.

52
Can prove that if B has at most three
recombination nodes, then this cannot happen.
Also, a recombination node cannot be in both B
and B if it has no recombination node below it
in B.
53
If Yes,
Then any method that computes a lower bound on
the number of needed recombinations should be
applied separately on sequences defined by the
sites on each connected component, and the
results added together. This may be true even if
the desired result is not. It is worth trying to
prove this.
54
How to arrange the sites on a blob
  • Given a single connected component of the
    conflict graph with k sites, how do we arrange
    those k sites to generate the required sequences,
    using only one
  • (multiple-crossover) recombination,
  • or using a multiple-crossover recombination
  • with the fewest cross-overs?

55
Arranging the sites
  • We will describe an O(n3) time method to arrange
    all of the blobs. O(n2) time is possible with
    a more complex method when only single-crossover
  • recombinations are allowed.

56
A needed fact in words
  • Let Q be a gall for the sites on
    connected-component C of the conflict graph.
  • Let MC be the matrix M restricted to the
    sites on C.
  • Let LQC be the sequences labeling the nodes of
    Q, restricted to the sites on C.
  • Claim The two sets of sequences are identical,
    i.e.,
  • MC LQC.

57
1 2 3 4 5
a b c d e f g
LQC are the node labels on Q restricted to the
sites in C
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
C
M
1
3
LQC
4
Q
3
001
1 3 4
a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g
01 0
010
1
s
101
p
a
Matrix MC is Matrix M restricted to the columns
in C.
2
110
b
c, e, f, g
d
Fact MC LQC
58
The idea for arranging the sites of C on B via a
short movie
59
4
Q
3
001
010
1
s
101
p
a
2
110
b
c, e, f, g
d
60
4
Q
3
001
010
1
101
a
b
c, e, f, g
110
d
61
4
B
3
001
010
1
101
a
c, e, f, g 010
b 101
110
d
Blob B minus the recombination node is a perfect
phylogeny for MC minus the recombinant
sequence all sites are on one or two paths from
the root and the two end sequences of those
paths can recombine at point r to recreate the
recombinant sequence.
62
The point
  • If we remove the recombinant node from B,
  • we have a phylogenetic tree (no cycles) for
  • the remaining sequences and hence
  • a perfect phylogenetic tree for the sequences in
    MC minus the recombinant sequence of B.
  • The sites in this tree are on one or two paths.
  • Moreover, the two end sequences on that perfect
    phylogeny can recombine to create the removed
    recombinant sequence.

63
The algorithm for arranging a blob B for C
1.Form the matrix MC. 2. For each row of MC,
remove the row, see if there is a perfect
phylogeny for the remaining rows. If yes, see if
the sites are in one or two paths, and the end
sequences can generate the removed row by a
recombination. Fact Every row that works gives
a permitted arrangement of the sites on B.
64
Optimality of Galled-Trees
Theorem (G,H,B,B) The minimum number of
recombination nodes in any phylogenetic network
for M is at least the number of non-trivial
connected components of the incompatibility
graph. Hence, when the sequences on each blob
on T(M) can be generated with a single
recombination node, the blobbed-tree
minimizes the number of recombination nodes over
all phylogenetic networks and all choices of
ancestral sequence. This solves the root-unknown
galled-tree problem in polynomial time. Code is
on the web.
65
The number of arrangements on agall (all-0
ancestral sequence)
Following the algorithmic approach above, one
can prove that the number of arrangements of any
gall is at most three, and this happens only if
the gall has two sites. If the gall has more
than two sites, then the number of arrangements
is at most two. If the gall has four or more
sites, with at least two sites on each side of
the recombination point (not the side of the
gall) then the arrangement is forced and unique.
66
What is the most tree-like Network?
  • Assume no node with only one child. Contracting
    each blob to a single node results in a tree. The
    number of edges in that tree is a measure of how
    tree-like the network is. The larger the number
    of edges,
  • the more tree-like is the network.
  • Then T(M) is the most tree-like network.

67
Practical and Accurate Lower Bound Computation
  • Unconstrained recombination.
  • Want efficient computation of lower bounds on the
    minimum number of recombinations needed.
  • Many methods HK, Haplotype, (history),
    Connected Comps, Composite Interval Composite,
    Gall-Interval Composite Method, Subsets, Recmin,
    Composite ILP. Some are specific for
    recombination, and some apply to any reticulation
    event.

68
The grandfather of all lower bounds - HK 1985
  • Arrange the nodes of the incompatibility graph on
    the line in order that the sites appear in the
    sequence.
  • The HK bound is the minimum number of vertical
    lines needed to cut every edge in the
    incompatibility graph. Weak bound, but widely
    used - not only to bound the number of
    recombinations, but also to suggest their
    locations.

69
HK Lower Bound
1 2 3 4 5
70
HK Lower Bound 1
1 2 3 4 5
71
More general view of HK
Given a set of intervals on the line, and for
each interval I, a number N(I), find the minimum
number of vertical lines so that every interval
I intersects at least N(I) of the vertical
lines. In HK, each incompatibility defines an
interval I where N(I) 1. Other methods find
different intervals and larger N(I) values. The
general problem is easy to solve by a
left-to-right myopic placement of vertical lines
sort the intervals by right end-point Process
the intervals left to right in that order when
the right endpoint of an interval I is reached,
place there (if needed) additional vertical so
that N(I) lines intersect I.
72
The general approach is called the Composite
Interval Method (Myers).
73
New bound 2003
  • The number of non-trivial connected components in
    the incompatibility graph (Gusfield, Hickerson
    Bafna, Bansal) - weak bound when used alone, but
    effective when used with the Composite Interval
    Method.

CC Bound 2 in the prior example.
74
Haplotype Bound (Simon Myers)
  • Rh Number of distinct sequences (rows) - Number
    of distinct sites (columns) -1 (folklore)
  • Before computing Rh, remove any site that is
    compatible with all other sites. A valid lower
    bound results - generally increases the bound.
  • Generally really bad bound, often negative, when
    used alone, but Very Good when used in the
    Composite Interval Method, and Amazing when used
    in the Composite Subset Method, and More than
    Amazing when used in the Composite ILP Method.

75
Composite Interval Method (Simon Myers)
  • Compute Rh separately for each of the C(m,2)
    intervals of the m sites Let Rh(I) be the bound
    for interval I. This is called a local bound.
    Compute the Minimum number of vertical lines so
    that each interval I is cut at least Rh(I) times
    (a trivial algorithm).
  • The HK bound is a specialization of this where
    Rh(I) is 0 or 1. Also can use the connected
    component bound for the local bounds, or take the
    best of both in each interval.

76
Composite Subset Method
  • Let S be subset of sites, and Rh(S) be the
    haplotype bound for subset S. If the leftmost
    site in S is L and the rightmost site in S is R,
    then use Rh(S) as a local bound for interval
    S,L.
  • Compute Rh(S) on many subsets, and then use the
    Composite method to find an overall bound.

77
RecMin (Myers)
  • World Champion Lower Bound Program (available)
    Great Program (Myers). Often RecMin gives a bound
    three times as large as HK.
  • Uses Composite Subsets with Rh, but limits the
    size and the span of the subsets. Default
    parameters are s 6, w 15 (s size, w
    span).
  • Generally, impractical to set s w m, so
    generally one doesnt know if increasing the
    parameters would increase the bound.

78
Optimal RecMin Bound (ORB)
  • The Optimal RecMin Bound is the lower bound that
    RecMin would produce if both parameters were set
    to their maximum values (s w m).
  • In general, RecMin cannot compute (in practical
    time) the ORB.
  • Practical computation of the ORB is our first
    contribution.

79
Computing the ORB
  • Gross Idea For each interval I, use ILP to find
    a subset of sites that maximizes Rh over all
    subsets in interval I. Set N(I) to the found Rh
    value, and use these values in the Composite
    Method.
  • The result is guaranteed to be the ORB. i.e, the
    bound one would get by using all 2m subsets in
    RecMin.

80
Now instead of doing an exponential number of
simple computations (computing Rh for each
subset), we have a quadratic number of (possibly
expensive) ILPs to solve. Is this a good
trade-off in practice? Our experience - yes!
81
Two critical tricks
  • Formulate the ILP as a test set problem Find
    the minimum
  • number of sites so that each pair of haplotypes
    differs at one
  • site at least. Compared to other ILP
    formulations, this allows large
  • eliminations of redundant inequalities, and
    greatly speeds up the
  • ILP solution.
  • 2) Reduce the number of intervals examined

I
If the optimal subset in the interval I only
spans interval L, then there is no Need to
examine any interval containing L. Each ILP calls
at most 4 other ILPs.
L
82
ILP Compsite method efficiency
  • Only C(m,2) intervals, but the time for each is
    more than for the Hap bound.
  • With tricks we need to solve the ILP for only
    0.5 of all the C(m,2) intervals (empirical
    result). Shockingly fast in practice.

83
Typical Result
  • With n 40, m 100, r 10 (ms recombination
    parameter)

HK gives 4 Composite Intervals with Hap Bound or
CC bound gives 6 Recmin Composite Subsets with
Hap Bound and subsets of size up to 20 in
windows of size 20 9 (7 secs) Composite
Subsets with Hap Bound and subsets of size up
to 30 in windows of size 30 10 (takes 10 mins
on Powerbook G4) ILP Composite gets 10 and takes
1 second on Linux Box.
84
Papers and Software
  • wwwcsif.cs.ucdavis.edu/gusfield/
Write a Comment
User Comments (0)
About PowerShow.com