Haplotype inference - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Haplotype inference

Description:

Used the 'EM model' with Dirichlet prior, ... SNPHAP is a EM-based algorithm developed by David Clayton. ... is quite accurate using EM or Clark's algorithm. ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 43

Provided by: sphU

Category:

Tags: em | haplotype | inference

more less

Transcript and Presenter's Notes

Title: Haplotype inference

1
Haplotype inference

Biostat 666
Winter 2006

2
Haplotype inference problem

1 2 3 M
A/C G/T A/A A/A
A/A G/G A/T A/A
N. A/C T/T A/A A/T

3
Haplotype inference problem

1 2 3 M
A G A A
C T A A
A G T A
C G A A
N. A T A T
C T A A

4
Haplotype inference problem

N sample size, not a serious concern.
M number of markers, source of worry.
Large M cause numerical problem for EM. Too many
haplotypes need to follow.
Large M produce many distinct haplotypes, chance
of unambiguous haplotypes are rare. Clarks have
difficulty to start.
Solutions PHASE, SNPHAP, PL, HAP,

5
PHASE

So far, very accurate, but also complicated.
Based on coalescence model.
Samples from the conditional distribution
pr(HiG,H-i) using approximation to a general
mutation model (Stephens and Donnelly 2000).

6
Example

The next haplotype is likely to look either
exactly the same as or similar to a haplotype
that has been observed

7
Algorithm

Start with some initial haplotype reconstruction
H(0). For t 0,1,2,, obtain H(t 1) from H(t)
using the following three steps
Choose an individual, i, uniformly and at random
from all ambiguous individuals (i.e., individuals
with more than one possible haplotype
reconstruction).
Sample H(t 1) from Pr (Hi G, H(t)-i), where H-
i is the set of haplotypes excluding individual
i.
Set H(t 1) H(t)i for j 1,,n, j ? i.

8
Details

Informally, this corresponds to the next sampled
haplotype, h, being obtained by applying a random
number of mutations, s, to a randomly chosen
existing haplotype, a, whereas s is sampled from
a geometric distribution. The approximation
formula above arose from consideration of the
distribution of the genealogy relating randomly
sampled individuals, as described by the
coalescent. In particular, future-sampled
chromosomes will tend to be more similar to
previously sampled chromosomes as the sample size
r increases and as the mutation rate ? decreases.

9
Gibbs Sampler
Geman and Geman 1984Gelfand and Smith 1990
10
Gibbs sampler

Want to sample from P(H1, H2,, HN).
Sample from P(H1G, H2,, HN).
Sample from P(H2G, H1,, HN).
Sample from P(HNG, H2,, HN-1).
For large M update only a subset of the loci of
a individual. H(S), H(-S) where S is a subset of
ambiguous loci for individual i.

11
Remarks

It is pseudo-Gibbs sampler scheme, since only
conditional distributions can be written down. Do
not know the joint distribution. Run the risk of
divergence.
Okay for variables taking finite discrete values.

12
Haplotyper

Another Bayesian model-based algorithm. Used the
EM model with Dirichlet prior,
The joint distribution can be expressed as the
follows
Sample a pair of compatible haplotypes for each
subject according to

13
Partition-Ligation
14
Partition-Ligation
15
Within Segment

EM or Gibbs Sampler
Haplotype and Frequency only
Retain top ones only

16
Partition-Ligation
17
Ligation Step
Left Segment
Right Segment
??1
??2
??3
??4
18
Partition-Ligation
19
Progressive Ligation
20
Related to SNPHAP

SNPHAP is a EM-based algorithm developed by David
Clayton.
SNPHAP start by fitting 2-locus haplotypes and
extending the solution by one locus at a time.
Solve the large M problem by not efficient.
Progressive ligation will be more efficient.

21
Remarks

The idea of PL is to reduce search space when
searching for ML solution. Kill partial solutions
that have little hope of being part of the ML
estimates.
Has nice biological interpretation haplotype
block. It has been shown (Niu et al. 2002) that
partition at recombination hot spot will result
in improved accuracy compared to cut at random .

22
Haplotype Block

Blocks, recombination hot spots, htSNP
Daly et al. Nat Genet 2001
Patil, et al. Science 2001, Chromosome 21
Dawson, et al. Nature 2002, Chromosome 22
Zhang et al. PNAS 2002

Jeffreys et al., Nat Genet 2001
23
Factors Influencing the Performance

Partition Sites
Partition at recombination hotspots improves the
performance marginally
Allow user to specify desirable partition points
Atomistic Unit Size
Little difference in performance
the computation time increased sharply when the
coarsest partition was used (K5-8 appeared to be
a good choice )
Buffer Size
Increase of the buffer size improves the
performance

24
Remarks

How to decide on segment boundary, the optimal
size of each segment, and how many haplotypes to
keep for the next round are all open questions,
not well addressed yet may affect the results.

25
Other techniques used

Predictive updating to integrate out nuisance
parameter T.
Prior annealing Large pseudo-count at the
beginning, Small pseudo-count near the end,
Decrescendo in the middle.

26
Lin, Cutler and Chakravarti

Similar to PHASE.
look for matches only at positions where the
individual is heterozygous, ignoring the data at
positions where the individual is homozygous.
The benefit is that the algorithm never reaches
the situation where no matching haplotypes
exist, and it therefore avoids choosing randomly
between all possible reconstructions.

27
Challenges

Haplotype inference for short regions (lt10kb) is
quite accurate using EM or Clarks algorithm.
Mutation, recombination events are rare. All
haplotypes are independent.
The challenging case is for long regions.
(gt100kb). Almost impossible to correctly infer
the entire haplotypes, the performance measure is
number of switch errors made.

28
Haplotype inference problem

For small M, mutation and recombination events
are rare. Haplotypes are inherited from
ancestors, and can be regarded as distinct.
Not true for large M, or strictly speaking,
across large genetic distance. Mutations and
recombinations cause some haplotypes to be
related.

29
HAP

Based on the perfect phylogeny model,
Halperin and Eskin 2003.

30
The algorithm

Not all haplotypes fit the perfect phylogeny
model. There maybe conflicts.
This algorithm infers the different relations
between the pairs of sites.
The algorithm produces a set of candidate
solutions that roughly fit the perfect phylogeny
model. The ML is used to choose the best
solution.

31
The argument about the prior

Dirichlet prior vs. Approximate coalescent prior.
Dirichlet prior with same pseudo counts.
Non-informative prior.
ACP is informative prior, based on population
genetics theory.
The impact of the priors.

32
Challenges

Haplotype inference for short regions (lt10kb) is
quite accurate, EM model holds, fast and
accurate.
The most challenging case is for long regions.
(gt100kb). Almost impossible to correctly infer
the entire haplotypes, the performance measure is
number of switch errors made.
Need to consider mutation and recombination.

33
Challenges

Need to revise the haplotype frequency model.
Since every haplotypes will be distinct, so
frequency 1.
Stephens and Donnelly 2003 Whatever ones view
on the accuracy of the coalescent as a model for
real data, it is difficult to imagine any actual
population sample where guessing the haplotypes
at random will be more accurate than choosing
haplotypes that are similar to others in the
sample.
Measure distance between haplotypes, give
haplotypes that are closer to known ones higher
chance to be selected.

34
Comparison results
35
Comparison results
36
Impact of HWE Assumption
Simulated scenario (1) Neutral (2) Moderate
Heterozygote Favoring (3) Strong Heterozygote
Favoring (4) Moderate Homozygote Favoring (5)
Strong Homozygote Favoring
Results Panel A (1)(2) (3) Panel B (1)(4)
(5) Panel C (1)(2) (3)(4)(5) Panel D
(1)(2) (3)(4)(5) (zoom-in view of left-tail
of C)
37
Extend to Family Data

Father A/a B/B c/c Mother A/a b/b
c/c Child A/a B/b c/c
38
Thank You
39
Remarks

New model that allows for recombination and
decay of Linkage Disequilibrium (LD) with
distance has recently been implemented. The
program also allows the user to estimate
recombination rates, and identify recombination
hotspots from population genotype data, and to
perform a test for haplotype frequency
differences between cases and controls.

40
Remarks

New version of PHASE also considers recombination
in addition to mutation in coalescence model, use
the Li and Stephens (2003) model to achieve this.
New haplotypes modeled as mosaic of known
haplotypes.
Another new program called fastPHASE was recently
proposed. Scheet and Stephens (2006).
Different from PHASE.

41
Ideas tackle the large M problem