Combinatorial Approaches to Haplotype Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Combinatorial Approaches to Haplotype Inference

Description:

Super-charging Clark's method of 1990 - Two ideas: Min haps ... Generic Clark Method. Given a known haplotype H (original ... Method. How? ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 44
Provided by: DanGus8
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial Approaches to Haplotype Inference


1
Combinatorial Approaches to Haplotype Inference
  • Dan Gusfield
  • CS, UC Davis

2
Three topics
  • Super-charging Clarks method of 1990 - Two
    ideas Min haps selection Consensus
  • Follow-on to Haplotyping By Perfect Phylogeny -
    a more biological view, and phase transitions,
    recombinations (?)
  • Computing True Parsimony Solutions,
    Clark/Parsimony hybrids - Integer Programming

3
Topic I Supercharging Clarks Method
S. Orzack, D. Gusfield, V. Stanton
Genetics, in revision
4
Generic Clark Method
Basic Step
Given a known haplotype H (original homozygote or
single-site heterozygote, or previously
inferred), and an unresolved genotype G, if G can
be explained by H and another vector H, then
call H a known haplotype, available for
additional inferrals. example H 0 1 0 0 1
G 2 1 0 2 2
G is resolved by H and H
------------------ H 1 1
0 1 0
In a single run, repeat the basic step until
stuck - resolve as many genotypes as possible in
the data.
Clark (1990) Randomize choices, and do the
computations many times to find an execution
(run) that explains the most genotypes.
5
Many variations of Clark
  • Variations based on which parts are randomized.
  • We closely examine eight variations on a real
    data set. Variation 1 randomizes every decision -
    probably more than Clark originally intended.
  • Truth in advertising - we implemented our own
    Clark versions - did not actually use Clarks
    software.

6
Super-charging Clarks method
  • Why? The reported accuracy is significantly
  • less than that of PHASE and HAPLOTYPER.
  • Answers 1) Scale
  • 2) Deeper insight

7
Supercharging Clarks Method
  • How? Run many times (10,000) to get (generally) a
    huge range of solutions, often 10,000 different
    solutions. This is a good thing!
  • Select the (few - in the 10s) runs that use the
    smallest number of distinct haplotypes over the
    10,000 runs. Let S be that set of runs.
  • Vote - for each individual find the haplotype
    pair used most often in S. This gives the
    Consensus solution.

8
Results on Lab Determined Data
  • APOE locus
  • 80 individuals, 9 polymorphic sites, 47 ambiguous
    after homozygotes and single-site heterozygotes
    identified - lab determined haplotypes

9
Variation 1
Average correct over all 10,000 runs 29.0 24
runs that use 20 haplotypes (smallest
observed) or 21 distinct haplotypes Average
correct over those 24 runs 36.1 0.9987
percentile of all the 10,000 runs Correct in the
Consensus solution from those 24 runs 39 which
is equal to the best of the 10,000 executions
10
Variation 4
Average over all 10,000 runs 18.3 Consensus
number of correct 42
11
Consensus Thresholds
The calls that are made with high frequency have
high reliability. We observed few or no false
calls among the haplotype pairs that were called
with high frequency (85 up). This identifies a
subset of calls that can be believed with
high confidence, and a subset that has to be
determined by other means. Bottom-up method to
molecularly determine the haplotypes.
12
Comparison with Phase and Haplotyper
  • Phase consistently gets 42 correct. 1 false call
    with confidence value gt 0.9
  • Haplotyper high variation.

13
distribution of high confidence wrong calls
distribution of calls that dont explain the
genotype
distribution of accuracy
14
Topic 2 Haplotyping as Perfect Phylogeny
15
The Perfect Phylogeny Model
  • We assume that the evolution of extant haplotypes
    can be displayed on a rooted, directed tree, with
    the all-0 haplotype at the root, where each site
  • changes from 0 to 1 on exactly one edge, and
    each extant haplotype is created by accumulating
    the changes on a path from the root to a leaf,
    where that haplotype is displayed.
  • In other words, the extant haplotypes evolved
    along a perfect phylogeny with all-0 root.

16
The Perfect Phylogeny Model
sites
12345
00000
Ancestral haplotype
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
17
Justification for Perfect Phylogeny Model
  • In the absence of recombination each haplotype of
    any individual has a single parent, so tracing
    back the history of the haplotypes in a
    population gives a tree.
  • Recent strong evidence for long regions of DNA
    with no recombination. Key to the NIH haplotype
    mapping project. (See NYT October 30, 2002)
  • Mutations are rare at selected sites, so are
    assumed non-recurrent.
  • Connection with coalescent models.

18
Solving the Haplotype Phylogeny Problem (PPH) in
nearly linear time
Gusfield, RECOMB, April 2002
  • Finds if there is a solution.
  • Counts the number of solutions.
  • Implicitly represents all the solutions

19
Program PPH
  • Program PPH solves the perfect phylogeny
    haplotyping problem using the graph realization
    approach. It solves problems with 50 sites and
    100 individuals in about 1 second.
  • Program PPH can be obtained at
  • www.cs.ucdavis.edu/gusfield

20
An alternative solution
Recently, we developed an algorithm that is not
based on graph realization, and which is much
easier to understand. However, it runs in O(nm2)
time. V. Bafna, D. Gusfield, G. Lancia, S.
Yooseph See Gusfields website.
21
Representing all the solutions
All the solutions to the PPH problem can be
implicitly represented in a data structure
which can be built in linear time, once one
solution is known. Then each solution can be
generated in linear time per solution.
22
The implicit representation A column partition
The algorithm partitions the sites (columns) into
classes, such that within any class, the phase
relationship (in, or out of phase) of any two
columns is INVARIANT over all perfect
phylogenies for the data. Between any two
classes, the phase relationship can be set
ARBITRARILLY. All solutions can be genetated in
this way. That is the representation of the set
of all solutions. The number of solutions is
always a power of two.
23
1 2 3 4 5 6 7
a
An example. Each row starts duplicated
for simplicity.
a
b
b
c
c
d
d
e
e
24
1 2 3 4 6 5 7
Starting from a PPH Solution, if all shaded
cells in a class switch value, then the result
is also a PPH solution, and any PPH solution can
be obtained in this way, i.e. by choosing in
each block whether to switch or not.
a
a
b
b
c
c
d
d
e
e
25
Secondary information and optimization
  • The partition shows explicitly what added
    information is useful and what is redundant.
    Information about the relationship of a pair of
    columns is redundant if and only if they are in
    the same class of the column partition. Apply
    this successively as additional information is
    obtained.
  • Problem Minimize the number of haplotype pairs
    (individuals) that need be laboratory determined
    in order to find the correct tree.
  • Minimize the number of (individual, site1, site2)
    triples whose phase relationship needs to be
    determined, in order to find the correct tree.

26
The implicit representation of all solutions
provides a framework for solving these secondary
problems, as well as other problems involving the
use of additional information, and specific
tree-selection criteria.
27
A Phase-Transition
Problem, as the ratio of sites to genotypes
changes, how does the probability that the PPH
solution is unique change? For greatest utility,
we want genotype data where the PPH solution is
unique. Intuitively, as the ratio of genotypes
to sites increases, the probability of uniqueness
increases.
28
Frequency of a unique solution with 50 and 100
sites, 5 rule and 2500 datasets per entry
geno. Frequency of unique
solution
29
Data with recombination
If the genotype data does not fit a perfect
phylogeny, we decompose the sites (greedily left
to right) into maximal intervals that do fit a
unique perfect phylogeny. Experimental results
number of intervals reduces to about a tenth. The
solutions are highly accurate in each interval.
Then we have a phase problem again between the
intervals. The final haplotype determination can
be made with one site per interval, greatly
reducing any lab effort.
30
Topic 3 The True Parsimony Objective
  • For a set of genotypes, find a Smallest set H of
    haplotypes, such that each genotype can be
    explained by a pair of haplotypes in H.

No Approximations - find the true smallest, and
know for sure you have it !
31
Pure Parsimony
  • Explain the N genotypes using the fewest number
    of distinct haplotypes.
  • Why? Empirically few haplotypes are seen in
    populations Coalescent theory analogy to other
    parsimony criteria common attempts to interpret
    Clarks haplotype inferral method as a parsimony
    method all methods are to a large extent
    parsimony methods - Donnelly yesterday.

32
Example of Parsimony
00100 01110
  • 02120

3 distinct haplotypes set S has size 3
01110 10110
22110
00100 10110
20120
33
Pure Parsimony is NP-hard
Earl Hubbel (Affymetrix) showed that Pure
Parsimony is NP-hard. However, for a range of
parameters of current interest (50 sites and 50
genotypes) a True Parsimony solution can be
computed efficiently, using Integer Linear
Programming, and one speed-up trick. For larger
parameters (100 sites and 50 genotypes) A
near-parsimony solution can be found efficiently.
34
How Fast? How Good?
Depends on the level of recombination in the
underlying data. True Parsimony can be computed
in seconds to minutes for most cases with 50
genotypes and up to 60 sites, faster as the level
of recombination increases.
As the level of recombination increases, the
accuracy of the True Parsimony Solution falls,
but remains within 10 of the quality of PHASE
(for comparison).
35
The APOE Data
There are 17 distinct haplotypes in the real
data. The IP finds a True Parsimony Solution
with 15 distinct haplotypes. PHASE and
HAPLOTYPER each use 15 haplotypes. Over the
10,000 executions, Clark variation 1 used a
minimum of 20 haplotypes.
36
Clark/Parsimony Hybrid
For low recombination, but many sites (say gt
100), the pure IP approach blows up. But by
connecting to Clarks approach, using the
Digraph view of Clark (Gusfield ISMB 2000, JCB
2001), the size of the IP is dramatically
smaller, and can run on a very large number of
sites. On datasets where the pure IP can be
run, the hybrid method is only a bit inferior to
the pure IP method.
37
Clark/Parsimony Hybrid
For low recombination, large (gt60) sites
  • Find an execution of Clarks method that
  • maximizes the number of genotypes resolved
  • minimizes the number of distinct haplotypes used

We can do this by mixing the Digraph View of
Clarks method (Gusfield 2001) with the
parsimony criteria, and truly find an execution
of Clarks method that minimizes the number
of distinct haplotypes used. On datasets where
we can compute True Parsimony, this hybrid does
only a bit worse than True Parsimony.
38
Other uses of IP
On datasets where we know the solution, find the
best that a Clark method can ever do. IP can
find the best possible execution. On the APOE
data, Clarks method can get all get 47 correct!
In fact in a huge number of ways. (But the best
we found by actually running Clarks method was
42 correct). This kind of test is not possible
for Statistical methods.
39
The Conceptual Integer Programming Formulation
For each genotype (individual) j, create one
integer programming variable Yij for each pair of
haplotypes whose merge creates genotype j. If j
has k 2s, then This creates 2(k-1) Y
variables. Create one integer programming
variable Xq for Each distinct haplotype q that
appears in one of the pairs for a Y variable.
40
Conceptual IP
For each genotype, create an equality that says
that exactly one of its Y variables must be set
to 1. For each variable Yij, whose two
haplotypes are given variables Xq and Xq,
include an inequality that says that if variable
Yij is set to 1, then both variables Xq and Xq
must be set to 1. Then the objective function is
to Minimize the sum of the X variables.
41
Example
02120 Creates a Y variable Y1 for pair 00100 X1

01110 X2 and a Y variable Y2 for
pair 01100 X3

00110 X4
Include the following (in)equalities into the IP
The objective function will include the
subexpression X1 X2 X3 X4 But any X
variable is included exactly once no matter how
many Y variables it is associated with.
Y1 Y2 1 Y1 - X1 lt 0 Y1 - X2 lt 0 Y2
- X3 lt 0 Y2 - X4 lt 0
42
Efficiency Tricks
Ignore any Y variable and its two X variables if
those X variables are associated with no other Y
variable. The Resulting IP is much smaller, and
can be used to find the optimal to the conceptual
IP. Also, we need not enumerate all X pairs for
a given genotype, but can efficiently recognize
the pairs we need.
43
Avoiding Enumeration of unneeded haplotypes
For each pair of genotypes, G1, G2 it is easy to
find all the haplotypes that appear in an
explanation for G1 and in an explanation for
G2. Example 0 2 1 1 0 2 0 2 0
1 1 1 2 2 0 2 0 1 1 1 0 V 0 2 V and then
generate all combinations of 0,1s over the V
sites. So the time is O(m x haps in both
explanation sets)
44
Recombination Helps!
As the level of recombination increases, the
number of haps in two explanation sets decreases,
so the size of the IP falls.
45
Full paper on Parsimony
See the technical report Haplotyping by Pure
Parsimony available at wwwcsif.cs.ucdavis.edu/gus
field/
46
Thanks to Andy, Sorin and Mike.
Write a Comment
User Comments (0)
About PowerShow.com