Haplotype inference - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Haplotype inference

Description:

Used the 'EM model' with Dirichlet prior, ... SNPHAP is a EM-based algorithm developed by David Clayton. ... is quite accurate using EM or Clark's algorithm. ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 43
Provided by: sphU
Category:

less

Transcript and Presenter's Notes

Title: Haplotype inference


1
Haplotype inference
  • Biostat 666
  • Winter 2006

2
Haplotype inference problem
  • 1 2 3 M
  • A/C G/T A/A A/A
  • A/A G/G A/T A/A
  • N. A/C T/T A/A A/T

3
Haplotype inference problem
  • 1 2 3 M
  • A G A A
  • C T A A
  • A G T A
  • C G A A
  • N. A T A T
  • C T A A

4
Haplotype inference problem
  • N sample size, not a serious concern.
  • M number of markers, source of worry.
  • Large M cause numerical problem for EM. Too many
    haplotypes need to follow.
  • Large M produce many distinct haplotypes, chance
    of unambiguous haplotypes are rare. Clarks have
    difficulty to start.
  • Solutions PHASE, SNPHAP, PL, HAP,

5
PHASE
  • So far, very accurate, but also complicated.
  • Based on coalescence model.
  • Samples from the conditional distribution
    pr(HiG,H-i) using approximation to a general
    mutation model (Stephens and Donnelly 2000).

6
Example
  • The next haplotype is likely to look either
    exactly the same as or similar to a haplotype
    that has been observed

7
Algorithm
  • Start with some initial haplotype reconstruction
    H(0). For t 0,1,2,, obtain H(t 1) from H(t)
    using the following three steps
  • Choose an individual, i, uniformly and at random
    from all ambiguous individuals (i.e., individuals
    with more than one possible haplotype
    reconstruction).
  • Sample H(t 1) from Pr (Hi G, H(t)-i), where H-
    i is the set of haplotypes excluding individual
    i.
  • Set H(t 1) H(t)i for j 1,,n, j ? i.

8
Details
  • Informally, this corresponds to the next sampled
    haplotype, h, being obtained by applying a random
    number of mutations, s, to a randomly chosen
    existing haplotype, a, whereas s is sampled from
    a geometric distribution. The approximation
    formula above arose from consideration of the
    distribution of the genealogy relating randomly
    sampled individuals, as described by the
    coalescent. In particular, future-sampled
    chromosomes will tend to be more similar to
    previously sampled chromosomes as the sample size
    r increases and as the mutation rate ? decreases.

9
Gibbs Sampler
Geman and Geman 1984Gelfand and Smith 1990
10
Gibbs sampler
  • Want to sample from P(H1, H2,, HN).
  • Sample from P(H1G, H2,, HN).
  • Sample from P(H2G, H1,, HN).
  • Sample from P(HNG, H2,, HN-1).
  • For large M update only a subset of the loci of
    a individual. H(S), H(-S) where S is a subset of
    ambiguous loci for individual i.

11
Remarks
  • It is pseudo-Gibbs sampler scheme, since only
    conditional distributions can be written down. Do
    not know the joint distribution. Run the risk of
    divergence.
  • Okay for variables taking finite discrete values.

12
Haplotyper
  • Another Bayesian model-based algorithm. Used the
    EM model with Dirichlet prior,
  • The joint distribution can be expressed as the
    follows
  • Sample a pair of compatible haplotypes for each
    subject according to

13
Partition-Ligation
14
Partition-Ligation
15
Within Segment
  • EM or Gibbs Sampler
  • Haplotype and Frequency only
  • Retain top ones only

16
Partition-Ligation
17
Ligation Step
Left Segment
Right Segment
??1
??2
??3
??4
18
Partition-Ligation
19
Progressive Ligation
20
Related to SNPHAP
  • SNPHAP is a EM-based algorithm developed by David
    Clayton.
  • SNPHAP start by fitting 2-locus haplotypes and
    extending the solution by one locus at a time.
  • Solve the large M problem by not efficient.
    Progressive ligation will be more efficient.

21
Remarks
  • The idea of PL is to reduce search space when
    searching for ML solution. Kill partial solutions
    that have little hope of being part of the ML
    estimates.
  • Has nice biological interpretation haplotype
    block. It has been shown (Niu et al. 2002) that
    partition at recombination hot spot will result
    in improved accuracy compared to cut at random .

22
Haplotype Block
  • Blocks, recombination hot spots, htSNP
  • Daly et al. Nat Genet 2001
  • Patil, et al. Science 2001, Chromosome 21
  • Dawson, et al. Nature 2002, Chromosome 22
  • Zhang et al. PNAS 2002

Jeffreys et al., Nat Genet 2001
23
Factors Influencing the Performance
  • Partition Sites
  • Partition at recombination hotspots improves the
    performance marginally
  • Allow user to specify desirable partition points
  • Atomistic Unit Size
  • Little difference in performance
  • the computation time increased sharply when the
    coarsest partition was used (K5-8 appeared to be
    a good choice )
  • Buffer Size
  • Increase of the buffer size improves the
    performance

24
Remarks
  • How to decide on segment boundary, the optimal
    size of each segment, and how many haplotypes to
    keep for the next round are all open questions,
    not well addressed yet may affect the results.

25
Other techniques used
  • Predictive updating to integrate out nuisance
    parameter T.
  • Prior annealing Large pseudo-count at the
    beginning, Small pseudo-count near the end,
    Decrescendo in the middle.

26
Lin, Cutler and Chakravarti
  • Similar to PHASE.
  • look for matches only at positions where the
    individual is heterozygous, ignoring the data at
    positions where the individual is homozygous.
  • The benefit is that the algorithm never reaches
    the situation where no matching haplotypes
    exist, and it therefore avoids choosing randomly
    between all possible reconstructions.

27
Challenges
  • Haplotype inference for short regions (lt10kb) is
    quite accurate using EM or Clarks algorithm.
    Mutation, recombination events are rare. All
    haplotypes are independent.
  • The challenging case is for long regions.
    (gt100kb). Almost impossible to correctly infer
    the entire haplotypes, the performance measure is
    number of switch errors made.

28
Haplotype inference problem
  • For small M, mutation and recombination events
    are rare. Haplotypes are inherited from
    ancestors, and can be regarded as distinct.
  • Not true for large M, or strictly speaking,
    across large genetic distance. Mutations and
    recombinations cause some haplotypes to be
    related.

29
HAP
  • Based on the perfect phylogeny model,
  • Halperin and Eskin 2003.

30
The algorithm
  • Not all haplotypes fit the perfect phylogeny
    model. There maybe conflicts.
  • This algorithm infers the different relations
    between the pairs of sites.
  • The algorithm produces a set of candidate
    solutions that roughly fit the perfect phylogeny
    model. The ML is used to choose the best
    solution.

31
The argument about the prior
  • Dirichlet prior vs. Approximate coalescent prior.
  • Dirichlet prior with same pseudo counts.
  • Non-informative prior.
  • ACP is informative prior, based on population
    genetics theory.
  • The impact of the priors.

32
Challenges
  • Haplotype inference for short regions (lt10kb) is
    quite accurate, EM model holds, fast and
    accurate.
  • The most challenging case is for long regions.
    (gt100kb). Almost impossible to correctly infer
    the entire haplotypes, the performance measure is
    number of switch errors made.
  • Need to consider mutation and recombination.

33
Challenges
  • Need to revise the haplotype frequency model.
    Since every haplotypes will be distinct, so
    frequency 1.
  • Stephens and Donnelly 2003 Whatever ones view
    on the accuracy of the coalescent as a model for
    real data, it is difficult to imagine any actual
    population sample where guessing the haplotypes
    at random will be more accurate than choosing
    haplotypes that are similar to others in the
    sample.
  • Measure distance between haplotypes, give
    haplotypes that are closer to known ones higher
    chance to be selected.

34
Comparison results
35
Comparison results
36
Impact of HWE Assumption
Simulated scenario (1) Neutral (2) Moderate
Heterozygote Favoring (3) Strong Heterozygote
Favoring (4) Moderate Homozygote Favoring (5)
Strong Homozygote Favoring
Results Panel A (1)(2) (3) Panel B (1)(4)
(5) Panel C (1)(2) (3)(4)(5) Panel D
(1)(2) (3)(4)(5) (zoom-in view of left-tail
of C)
37
Extend to Family Data

Father A/a B/B c/c Mother A/a b/b
c/c Child A/a B/b c/c
38
Thank You
39
Remarks
  • New model that allows for recombination and
    decay of Linkage Disequilibrium (LD) with
    distance has recently been implemented. The
    program also allows the user to estimate
    recombination rates, and identify recombination
    hotspots from population genotype data, and to
    perform a test for haplotype frequency
    differences between cases and controls.

40
Remarks
  • New version of PHASE also considers recombination
    in addition to mutation in coalescence model, use
    the Li and Stephens (2003) model to achieve this.
  • New haplotypes modeled as mosaic of known
    haplotypes.
  • Another new program called fastPHASE was recently
    proposed. Scheet and Stephens (2006).
  • Different from PHASE.

41
Ideas tackle the large M problem
  • SNPHAP start by fitting 2-locus haplotypes and
    extending the solution by one locus at a time.

42
Other issues
  • Pooled DNA data.
  • Uncertainty in haplotype inference.
Write a Comment
User Comments (0)
About PowerShow.com