Title: METHODS FOR HAPLOTYPE RECONSTRUCTION
1METHODS FOR HAPLOTYPE RECONSTRUCTION
- Andrew Morris
- Wellcome Trust Centre for Human Genetics
- March 6, 2003
2Outline
- Haplotypes and genotypes.
- Reconstruction in pedigrees.
- Reconstruction in unrelated individuals.
- Interpretation and LD assessment.
- Two stage analyses.
3Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0
11 01 00 00 01
4Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0
11 01 00 00 01
5Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0
11 01 00 00 01
6Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0
11 01 00 00 01
7Haplotypes and genotypes (2)
- Individuals that are homozygous at every locus,
or heterozygous at just one locus can be
resolved. - Individuals that are heterozygous at k loci are
consistent with 2k-1 configurations of haplotypes.
8Why do we need haplotypes?
- Correlation between alleles at closely linked
loci - Fine-scale mapping studies.
- Association studies with multiple markers in
candidate genes. - Investigating patterns of LD across genomic
regions. - Inferring population histories.
9Molecular methods
- Single molecule dilution.
- Allele specific long range PCR.
- Prone to errors.
- Expensive and inefficient low throughput.
10Simplex family data (1)
- 00 01 00 11 x 01 11 01 01
- (M) (F)
- 00 01 01 01
11Simplex family data (1)
- 00 01 00 11 x 01 11 01 01
- (M) (F)
- 00 01 01 01
12Simplex family data (1)
- 00 01 00 11 x 01 11 01 01
- (M) (F)
- 00 01 01 01
- Inferred haplotypes 0001 / 0110
13Simplex family data (2)
- 00 01 00 01 x 01 01 00 01
- (M) (F)
- 00 01 00 01
- Cannot be fully resolved
14Pedigree data (1)
- 11 01 11 01 11 x 00 00 11 11 11
- 01 01 11 11 11 x 01 00 00 01 00
- 01 01 01 01 01 11 01 01 01 01 00 00 01 11
01
15Pedigree data (1)
- 11111 / 10101 x 00111 / 00111
- 11111 / 00111 x 00010 / 10000
- 11111 / 00000 11111 / 10000 00111 /
00010
16Pedigree data (1)
- 11111 / 10101 x 00111 / 00111
- 11111 / 00111 x 00010 / 10000
- 11111 / 00000 11111 / 10000 00111 /
00010
17Pedigree data (2)
- Many combinations of haplotypes may be consistent
with pedigree genotype data. - Complex computational problem.
- Need to make assumptions about recombination.
- SIMWALK and MERLIN.
18Statistical approaches to reconstruct haplotypes
in unrelated individuals
- Parsimony methods Clarks algorithm.
- Likelihood methods E-M algorithm.
- Bayesian methods PHASE algorithm.
- Aims reconstruct haplotypes and/or estimate
population frequencies.
19Clarks algorithm (1)
- Reconstruct haplotypes in unresolved individuals
via parsimony. - Minimise number of haplotypes observed in sample.
- Microsatellite or SNP genotypes.
20Clarks algorithm (2)
- Search for resolved individuals, and record all
recovered haplotypes. - Compare each unresolved individual with list of
recovered haplotypes. - If a recovered haplotype is identified,
individual is resolved. - Complimentary haplotype added to list of
recovered haplotypes. - Repeat 2-4 until all individuals are resolved or
no more haplotypes can be recovered.
21Example
- (A) 00 01 01 00
- (B) 00 00 00 00
- (C) 00 01 00 00
- (D) 01 11 01 11
- (E) 00 11 01 01
- (F) 01 11 11 00
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 00 00 00 00
- (J) 00 00 00 11
22Example
- (A) 00 01 01 00
- (B) 00 00 00 00
- (C) 00 01 00 00
- (D) 01 11 01 11
- (E) 00 11 01 01
- (F) 01 11 11 00
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 00 00 00 00
- (J) 00 00 00 11
23Example
- (A) 00 01 01 00
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 00 11 01 01
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000
- 0100
- 0110
- 1110
- 0001
24Example
- (A) 00 01 01 00
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 00 11 01 01
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000
- 0100
- 0110
- 1110
- 0001
25Example
- (A) 0000 / 0110
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 00 11 01 01
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000 0111
- 0100
- 0110
- 1110
- 0001
26Example
- (A) 0000 / 0110
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 0100 / 0111
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000 0111
- 0100 0011
- 0110
- 1110
- 0001
27Example
- (A) 0000 / 0110
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 0111 / 1101
- (E) 0100 / 0111
- (F) 0110 / 1110
- (G) 0110 / 0011
- (H) 0001 / 0111
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000 0111
- 0100 0011
- 0110 1101
- 1110
- 0001
28Example problem
- (A) 0000 / 0110
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 0100 / 0111
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000 0111
- 0100 0011
- 0110
- 1110
- 0001
29Example problem
- (A) 0000 / 0110
- (B) 0000 / 0000
- (C) 0000 / 0100
- (D) 01 11 01 11
- (E) 0100 / 0111
- (F) 0110 / 1110
- (G) 00 01 11 01
- (H) 00 01 01 11
- (I) 0000 / 0000
- (J) 0001 / 0001
- Recovered haplotypes
- 0000 0111
- 0100 0010
- 0110
- 1110
- 0001
30Clarks algorithm problems
- Multiple solutions try many different orderings
of individuals. - No starting point for algorithm.
- Algorithm may leave many unresolved individuals.
- How to deal with missing data?
31E-M algorithm (1)
- Maximum likelihood method for population
haplotype frequency estimation. - Allows for the fact that unresolved genotypes
could be constructed from many different
haplotype configurations. - Microsatellite or SNP genotypes.
32E-M algorithm (2)
- Observed sample of N individuals with genotypes,
G. - Unobserved population haplotype frequencies, h.
- Unobserved configurations, H, consisting of a
complimentary haplotype pairs Hi Hi1,Hi2.
33E-M algorithm (3)
- Likelihood
- f(Gh) ?k f(Gkh)
- ?k ?i f(GkHi) f(Hih)
- where f(Hih) f(Hi1h) f(Hi2h) under
Hardy-Weinberg equilibrium.
34E-M algorithm (4)
- Numerical algorithm used to obtain maximum
likelihood estimates of h. - Initial set of haplotype frequencies h(0).
- Haplotype frequencies h(t) at iteration t updated
from frequencies at iteration t-1 using
Expectation and Maximisation steps. - Continue until h(t) has converged.
35Expectation step
- Use haplotype frequencies, h(t), to calculate the
probability of resolving each genotype, Gk, into
each possible haplotype configuration, Hi. - E(HiGk,h(t)) f(GkHi) f(Hi1h(t)) f(Hi2h(t))
- f(Gkh(t))
36Maximisation step
- Compute haplotype frequencies using procedure
equivalent to gene counting. - hs(t1) ?k ?i Zsi E(HiGk,h(t))
- 2N
- Zsi number of copies (0,1,2) of sth haplotype
in configuration Hi.
37E-M algorithm comments
- Can handle missing data.
- For many loci, the number of possible haplotypes
is large, so population frequencies are difficult
to estimate re-parameterisation. - Does not provide reconstructed haplotype
configuration for unresolved individuals can use
maximum likelihood configuration.
38PHASE algorithm (1)
- Treats haplotype configuration for each
unresolved individual as an unobserved random
quantity. - Evaluate the conditional distribution, given a
sample of unresolved genotype data. - Microsatellite or SNP genotypes.
- Reconstruction and population haplotype frequency
estimation.
39PHASE algorithm (2)
- Bayesian framework goal is to approximate
posterior distribution of haplotype
configurations f(HG). - Implements Markov chain Monte Carlo (MCMC)
methods to sample from f(HG) Gibbs sampling. - Start at random configuration.
- Repeatedly select unresolved individuals at
random, and sample from their possible haplotype
configurations, assuming all other individuals to
be correctly resolved.
40PHASE algorithm (3)
- Initial haplotype configuration H(0).
- Subsequent iterations obtain H(t1) from H(t)
using the following steps - Select an unresolved individual,i, at random.
- Sample Hi(t1) from f(HiG,H-i(t)).
- Set Hk(t1) Hk(t) for all k ? i.
- On convergence, each sampled configuration
represents random draw from f(HG).
41PHASE algorithm (4)
- How to obtain f(HiG,H-i)?
- Base directly on sample frequency of observed
haplotypes in configuration H-i. - Better to introduce prior model for population
haplotype frequencies, f(h). - Coalescent process used to predict likely
patterns of haplotypes occurring in populations.
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46PHASE algorithm (5)
- Key principle
- Configuration Hi is more likely to consist of
haplotypes Hi1 and Hi2 that are exactly the same
as, or similar to, haplotypes in the
configuration H-i.
47Example (1)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 35 34 44.
48Example (1)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 35 34 44.
- Possible configurations
- (1) 22334 / 22544
- (2) 22534 / 22344
49Example (1)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 35 34 44.
- Possible configurations
- (1) 22334 / 22544
- (2) 22534 / 22344 -1 and 1
50Example (1)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 35 34 44.
- Possible configurations
- (1) 22334 / 22544
- (2) 22534 / 22344 -1 and 1
- Assign high probability to sampling configuration
(1).
51Example (2)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 46 34 44.
52Example (2)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 46 34 44.
- Possible configurations
- (1) 22434 / 22644
- (2) 22634 / 22444
53Example (2)
- Resolved haplotypes 22544 and 22334.
- Unresolved individual 22 22 46 34 44.
- Possible configurations
- (1) 22434 / 22644 1 and 1
- (2) 22634 / 22444 3 and -1
- Assign high probability to sampling configuration
(1).
54PHASE algorithm comments
- Allows for uncertainty in haplotype
reconstruction in Bayesian framework. - Can handle missing data.
- Coalescent process does not explicitly allow for
recombination, but performs well even when
cross-over events occur (up to 0.1cM). - Up to 50 more efficient than Clarks algorithm
or the E-M algorithm.
55PHASE algorithm output
- Best reconstruction output for each individual.
- Uncertainty in reconstruction indicated by system
of brackets - inferred missing genotype uncertain with
posterior probability less than specified
threshold - () inferred phase assignment uncertain with
posterior probability less than specified
threshold. - 0 (1) 0 0 1 (0)
- 0 (0) 1 0 1 (1)
56PHASE algorithm interpretation
- Best reconstruction not necessarily correct.
- Uncertain haplotype configurations should be
investigated further. - Effective targeting of additional genotyping
costs.
57Other Bayesian MCMC algorithms
- HAPLOTYPER
- Prior model for haplotype frequencies given by
Dirichelet distribution. - Deals with large number of SNPs by partition
ligation. - Outputs best reconstruction with uncertainty
measured by posterior probability. - HAPMCMC
- Log-linear prior model for haplotype frequencies
incorporating interactions corresponding to first
order LD between SNPs. - Designed specifically for investigating LD across
small genomic regions.
58Dont ignore uncertainty
- Tempting to treat best configuration as correct
in subsequent analyses. - Can seriously affect inferences
- Example LD across small genomic regions
59False positive error rates
Level E-M PHASE (BEST) HAPMCMC (BEST) HAPMCMC (POST)
0.01 0.008 0.204 0.060 0.009
0.05 0.045 0.362 0.183 0.056
0.1 0.103 0.446 0.247 0.091
0.2 0.212 0.566 0.392 0.201
0.5 0.524 0.767 0.640 0.527
60Two stage analyses
- For many studies, haplotype reconstruction is
just an intermediate step required for a second
stage of analysis. - We dont actually care what the haplotypes are
missing data that we must account for in second
stage of analysis. - Better to develop methods that allow for
uncertainty in haplotype configuration in
unresolved individuals, rather than haplotype
reconstruction methods per-se.
61Summary
- Haplotype information required for analysis of
high-density marker data. - Many algorithms available.
- Interpret output with care.
- Reconstructed haplotypes often not of interest,
but uncertainty must be accounted for in
subsequent analyses.