Title: Haplotype inference
1Haplotype inference
2Haplotype inference problem
- 1 2 3 M
- A/C G/T A/A A/A
- A/A G/G A/T A/A
-
- N. A/C T/T A/A A/T
3Haplotype inference problem
- 1 2 3 M
- A G A A
- C T A A
- A G T A
- C G A A
-
- N. A T A T
- C T A A
4Haplotype inference problem
- N sample size, not a serious concern.
- M number of markers, source of worry.
- Large M cause numerical problem for EM. Too many
haplotypes need to follow. - Large M produce many distinct haplotypes, chance
of unambiguous haplotypes are rare. Clarks have
difficulty to start. - Solutions PHASE, SNPHAP, PL, HAP,
5PHASE
- So far, very accurate, but also complicated.
- Based on coalescence model.
- Samples from the conditional distribution
pr(HiG,H-i) using approximation to a general
mutation model (Stephens and Donnelly 2000).
6Example
- The next haplotype is likely to look either
exactly the same as or similar to a haplotype
that has been observed
7Algorithm
- Start with some initial haplotype reconstruction
H(0). For t 0,1,2,, obtain H(t 1) from H(t)
using the following three steps - Choose an individual, i, uniformly and at random
from all ambiguous individuals (i.e., individuals
with more than one possible haplotype
reconstruction). - Sample H(t 1) from Pr (Hi G, H(t)-i), where H-
i is the set of haplotypes excluding individual
i. - Set H(t 1) H(t)i for j 1,,n, j ? i.
8Details
- Informally, this corresponds to the next sampled
haplotype, h, being obtained by applying a random
number of mutations, s, to a randomly chosen
existing haplotype, a, whereas s is sampled from
a geometric distribution. The approximation
formula above arose from consideration of the
distribution of the genealogy relating randomly
sampled individuals, as described by the
coalescent. In particular, future-sampled
chromosomes will tend to be more similar to
previously sampled chromosomes as the sample size
r increases and as the mutation rate ? decreases.
9Gibbs Sampler
Geman and Geman 1984Gelfand and Smith 1990
10Gibbs sampler
- Want to sample from P(H1, H2,, HN).
- Sample from P(H1G, H2,, HN).
- Sample from P(H2G, H1,, HN).
-
- Sample from P(HNG, H2,, HN-1).
- For large M update only a subset of the loci of
a individual. H(S), H(-S) where S is a subset of
ambiguous loci for individual i.
11Remarks
- It is pseudo-Gibbs sampler scheme, since only
conditional distributions can be written down. Do
not know the joint distribution. Run the risk of
divergence. - Okay for variables taking finite discrete values.
12Haplotyper
- Another Bayesian model-based algorithm. Used the
EM model with Dirichlet prior, - The joint distribution can be expressed as the
follows - Sample a pair of compatible haplotypes for each
subject according to -
13Partition-Ligation
14Partition-Ligation
15Within Segment
- EM or Gibbs Sampler
- Haplotype and Frequency only
- Retain top ones only
16Partition-Ligation
17Ligation Step
Left Segment
Right Segment
??1
??2
??3
??4
18Partition-Ligation
19Progressive Ligation
20Related to SNPHAP
- SNPHAP is a EM-based algorithm developed by David
Clayton. - SNPHAP start by fitting 2-locus haplotypes and
extending the solution by one locus at a time. - Solve the large M problem by not efficient.
Progressive ligation will be more efficient.
21Remarks
- The idea of PL is to reduce search space when
searching for ML solution. Kill partial solutions
that have little hope of being part of the ML
estimates. - Has nice biological interpretation haplotype
block. It has been shown (Niu et al. 2002) that
partition at recombination hot spot will result
in improved accuracy compared to cut at random .
22Haplotype Block
- Blocks, recombination hot spots, htSNP
- Daly et al. Nat Genet 2001
- Patil, et al. Science 2001, Chromosome 21
- Dawson, et al. Nature 2002, Chromosome 22
- Zhang et al. PNAS 2002
Jeffreys et al., Nat Genet 2001
23Factors Influencing the Performance
- Partition Sites
- Partition at recombination hotspots improves the
performance marginally - Allow user to specify desirable partition points
- Atomistic Unit Size
- Little difference in performance
- the computation time increased sharply when the
coarsest partition was used (K5-8 appeared to be
a good choice ) - Buffer Size
- Increase of the buffer size improves the
performance
24Remarks
- How to decide on segment boundary, the optimal
size of each segment, and how many haplotypes to
keep for the next round are all open questions,
not well addressed yet may affect the results.
25Other techniques used
- Predictive updating to integrate out nuisance
parameter T. - Prior annealing Large pseudo-count at the
beginning, Small pseudo-count near the end,
Decrescendo in the middle.
26Lin, Cutler and Chakravarti
- Similar to PHASE.
- look for matches only at positions where the
individual is heterozygous, ignoring the data at
positions where the individual is homozygous. - The benefit is that the algorithm never reaches
the situation where no matching haplotypes
exist, and it therefore avoids choosing randomly
between all possible reconstructions.
27Challenges
- Haplotype inference for short regions (lt10kb) is
quite accurate using EM or Clarks algorithm.
Mutation, recombination events are rare. All
haplotypes are independent. - The challenging case is for long regions.
(gt100kb). Almost impossible to correctly infer
the entire haplotypes, the performance measure is
number of switch errors made.
28Haplotype inference problem
- For small M, mutation and recombination events
are rare. Haplotypes are inherited from
ancestors, and can be regarded as distinct. - Not true for large M, or strictly speaking,
across large genetic distance. Mutations and
recombinations cause some haplotypes to be
related.
29HAP
- Based on the perfect phylogeny model,
- Halperin and Eskin 2003.
30The algorithm
- Not all haplotypes fit the perfect phylogeny
model. There maybe conflicts. - This algorithm infers the different relations
between the pairs of sites. - The algorithm produces a set of candidate
solutions that roughly fit the perfect phylogeny
model. The ML is used to choose the best
solution.
31The argument about the prior
- Dirichlet prior vs. Approximate coalescent prior.
- Dirichlet prior with same pseudo counts.
- Non-informative prior.
- ACP is informative prior, based on population
genetics theory. - The impact of the priors.
32Challenges
- Haplotype inference for short regions (lt10kb) is
quite accurate, EM model holds, fast and
accurate. - The most challenging case is for long regions.
(gt100kb). Almost impossible to correctly infer
the entire haplotypes, the performance measure is
number of switch errors made. - Need to consider mutation and recombination.
33Challenges
- Need to revise the haplotype frequency model.
Since every haplotypes will be distinct, so
frequency 1. - Stephens and Donnelly 2003 Whatever ones view
on the accuracy of the coalescent as a model for
real data, it is difficult to imagine any actual
population sample where guessing the haplotypes
at random will be more accurate than choosing
haplotypes that are similar to others in the
sample. - Measure distance between haplotypes, give
haplotypes that are closer to known ones higher
chance to be selected.
34Comparison results
35Comparison results
36Impact of HWE Assumption
Simulated scenario (1) Neutral (2) Moderate
Heterozygote Favoring (3) Strong Heterozygote
Favoring (4) Moderate Homozygote Favoring (5)
Strong Homozygote Favoring
Results Panel A (1)(2) (3) Panel B (1)(4)
(5) Panel C (1)(2) (3)(4)(5) Panel D
(1)(2) (3)(4)(5) (zoom-in view of left-tail
of C)
37Extend to Family Data
Father A/a B/B c/c Mother A/a b/b
c/c Child A/a B/b c/c
38Thank You
39Remarks
- New model that allows for recombination and
decay of Linkage Disequilibrium (LD) with
distance has recently been implemented. The
program also allows the user to estimate
recombination rates, and identify recombination
hotspots from population genotype data, and to
perform a test for haplotype frequency
differences between cases and controls.
40Remarks
- New version of PHASE also considers recombination
in addition to mutation in coalescence model, use
the Li and Stephens (2003) model to achieve this.
- New haplotypes modeled as mosaic of known
haplotypes. - Another new program called fastPHASE was recently
proposed. Scheet and Stephens (2006). - Different from PHASE.
41Ideas tackle the large M problem
- SNPHAP start by fitting 2-locus haplotypes and
extending the solution by one locus at a time.
42Other issues
- Pooled DNA data.
- Uncertainty in haplotype inference.