METHODS FOR HAPLOTYPE RECONSTRUCTION - PowerPoint PPT Presentation

About This Presentation
Title:

METHODS FOR HAPLOTYPE RECONSTRUCTION

Description:

If a recovered haplotype is identified, individual is resolved. ... Maximum likelihood method for population haplotype frequency estimation. ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 62
Provided by: amo132
Category:

less

Transcript and Presenter's Notes

Title: METHODS FOR HAPLOTYPE RECONSTRUCTION


1
METHODS FOR HAPLOTYPE RECONSTRUCTION
  • Andrew Morris
  • Wellcome Trust Centre for Human Genetics
  • March 6, 2003

2
Outline
  • Haplotypes and genotypes.
  • Reconstruction in pedigrees.
  • Reconstruction in unrelated individuals.
  • Interpretation and LD assessment.
  • Two stage analyses.

3
Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0

11 01 00 00 01

4
Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0

11 01 00 00 01

5
Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0

11 01 00 00 01

6
Haplotypes and genotypes (1)
1 0 0 0 1
1 1 0 0 0

11 01 00 00 01

7
Haplotypes and genotypes (2)
  • Individuals that are homozygous at every locus,
    or heterozygous at just one locus can be
    resolved.
  • Individuals that are heterozygous at k loci are
    consistent with 2k-1 configurations of haplotypes.

8
Why do we need haplotypes?
  • Correlation between alleles at closely linked
    loci
  • Fine-scale mapping studies.
  • Association studies with multiple markers in
    candidate genes.
  • Investigating patterns of LD across genomic
    regions.
  • Inferring population histories.

9
Molecular methods
  • Single molecule dilution.
  • Allele specific long range PCR.
  • Prone to errors.
  • Expensive and inefficient low throughput.

10
Simplex family data (1)
  • 00 01 00 11 x 01 11 01 01
  • (M) (F)
  • 00 01 01 01

11
Simplex family data (1)
  • 00 01 00 11 x 01 11 01 01
  • (M) (F)
  • 00 01 01 01

12
Simplex family data (1)
  • 00 01 00 11 x 01 11 01 01
  • (M) (F)
  • 00 01 01 01
  • Inferred haplotypes 0001 / 0110

13
Simplex family data (2)
  • 00 01 00 01 x 01 01 00 01
  • (M) (F)
  • 00 01 00 01
  • Cannot be fully resolved

14
Pedigree data (1)
  • 11 01 11 01 11 x 00 00 11 11 11
  • 01 01 11 11 11 x 01 00 00 01 00
  • 01 01 01 01 01 11 01 01 01 01 00 00 01 11
    01

15
Pedigree data (1)
  • 11111 / 10101 x 00111 / 00111
  • 11111 / 00111 x 00010 / 10000
  • 11111 / 00000 11111 / 10000 00111 /
    00010

16
Pedigree data (1)
  • 11111 / 10101 x 00111 / 00111
  • 11111 / 00111 x 00010 / 10000
  • 11111 / 00000 11111 / 10000 00111 /
    00010

17
Pedigree data (2)
  • Many combinations of haplotypes may be consistent
    with pedigree genotype data.
  • Complex computational problem.
  • Need to make assumptions about recombination.
  • SIMWALK and MERLIN.

18
Statistical approaches to reconstruct haplotypes
in unrelated individuals
  • Parsimony methods Clarks algorithm.
  • Likelihood methods E-M algorithm.
  • Bayesian methods PHASE algorithm.
  • Aims reconstruct haplotypes and/or estimate
    population frequencies.

19
Clarks algorithm (1)
  • Reconstruct haplotypes in unresolved individuals
    via parsimony.
  • Minimise number of haplotypes observed in sample.
  • Microsatellite or SNP genotypes.

20
Clarks algorithm (2)
  • Search for resolved individuals, and record all
    recovered haplotypes.
  • Compare each unresolved individual with list of
    recovered haplotypes.
  • If a recovered haplotype is identified,
    individual is resolved.
  • Complimentary haplotype added to list of
    recovered haplotypes.
  • Repeat 2-4 until all individuals are resolved or
    no more haplotypes can be recovered.

21
Example
  • (A) 00 01 01 00
  • (B) 00 00 00 00
  • (C) 00 01 00 00
  • (D) 01 11 01 11
  • (E) 00 11 01 01
  • (F) 01 11 11 00
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 00 00 00 00
  • (J) 00 00 00 11

22
Example
  • (A) 00 01 01 00
  • (B) 00 00 00 00
  • (C) 00 01 00 00
  • (D) 01 11 01 11
  • (E) 00 11 01 01
  • (F) 01 11 11 00
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 00 00 00 00
  • (J) 00 00 00 11

23
Example
  • (A) 00 01 01 00
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 00 11 01 01
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000
  • 0100
  • 0110
  • 1110
  • 0001

24
Example
  • (A) 00 01 01 00
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 00 11 01 01
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000
  • 0100
  • 0110
  • 1110
  • 0001

25
Example
  • (A) 0000 / 0110
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 00 11 01 01
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000 0111
  • 0100
  • 0110
  • 1110
  • 0001

26
Example
  • (A) 0000 / 0110
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 0100 / 0111
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000 0111
  • 0100 0011
  • 0110
  • 1110
  • 0001

27
Example
  • (A) 0000 / 0110
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 0111 / 1101
  • (E) 0100 / 0111
  • (F) 0110 / 1110
  • (G) 0110 / 0011
  • (H) 0001 / 0111
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000 0111
  • 0100 0011
  • 0110 1101
  • 1110
  • 0001

28
Example problem
  • (A) 0000 / 0110
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 0100 / 0111
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000 0111
  • 0100 0011
  • 0110
  • 1110
  • 0001

29
Example problem
  • (A) 0000 / 0110
  • (B) 0000 / 0000
  • (C) 0000 / 0100
  • (D) 01 11 01 11
  • (E) 0100 / 0111
  • (F) 0110 / 1110
  • (G) 00 01 11 01
  • (H) 00 01 01 11
  • (I) 0000 / 0000
  • (J) 0001 / 0001
  • Recovered haplotypes
  • 0000 0111
  • 0100 0010
  • 0110
  • 1110
  • 0001

30
Clarks algorithm problems
  • Multiple solutions try many different orderings
    of individuals.
  • No starting point for algorithm.
  • Algorithm may leave many unresolved individuals.
  • How to deal with missing data?

31
E-M algorithm (1)
  • Maximum likelihood method for population
    haplotype frequency estimation.
  • Allows for the fact that unresolved genotypes
    could be constructed from many different
    haplotype configurations.
  • Microsatellite or SNP genotypes.

32
E-M algorithm (2)
  • Observed sample of N individuals with genotypes,
    G.
  • Unobserved population haplotype frequencies, h.
  • Unobserved configurations, H, consisting of a
    complimentary haplotype pairs Hi Hi1,Hi2.

33
E-M algorithm (3)
  • Likelihood
  • f(Gh) ?k f(Gkh)
  • ?k ?i f(GkHi) f(Hih)
  • where f(Hih) f(Hi1h) f(Hi2h) under
    Hardy-Weinberg equilibrium.

34
E-M algorithm (4)
  • Numerical algorithm used to obtain maximum
    likelihood estimates of h.
  • Initial set of haplotype frequencies h(0).
  • Haplotype frequencies h(t) at iteration t updated
    from frequencies at iteration t-1 using
    Expectation and Maximisation steps.
  • Continue until h(t) has converged.

35
Expectation step
  • Use haplotype frequencies, h(t), to calculate the
    probability of resolving each genotype, Gk, into
    each possible haplotype configuration, Hi.
  • E(HiGk,h(t)) f(GkHi) f(Hi1h(t)) f(Hi2h(t))
  • f(Gkh(t))

36
Maximisation step
  • Compute haplotype frequencies using procedure
    equivalent to gene counting.
  • hs(t1) ?k ?i Zsi E(HiGk,h(t))
  • 2N
  • Zsi number of copies (0,1,2) of sth haplotype
    in configuration Hi.

37
E-M algorithm comments
  • Can handle missing data.
  • For many loci, the number of possible haplotypes
    is large, so population frequencies are difficult
    to estimate re-parameterisation.
  • Does not provide reconstructed haplotype
    configuration for unresolved individuals can use
    maximum likelihood configuration.

38
PHASE algorithm (1)
  • Treats haplotype configuration for each
    unresolved individual as an unobserved random
    quantity.
  • Evaluate the conditional distribution, given a
    sample of unresolved genotype data.
  • Microsatellite or SNP genotypes.
  • Reconstruction and population haplotype frequency
    estimation.

39
PHASE algorithm (2)
  • Bayesian framework goal is to approximate
    posterior distribution of haplotype
    configurations f(HG).
  • Implements Markov chain Monte Carlo (MCMC)
    methods to sample from f(HG) Gibbs sampling.
  • Start at random configuration.
  • Repeatedly select unresolved individuals at
    random, and sample from their possible haplotype
    configurations, assuming all other individuals to
    be correctly resolved.

40
PHASE algorithm (3)
  • Initial haplotype configuration H(0).
  • Subsequent iterations obtain H(t1) from H(t)
    using the following steps
  • Select an unresolved individual,i, at random.
  • Sample Hi(t1) from f(HiG,H-i(t)).
  • Set Hk(t1) Hk(t) for all k ? i.
  • On convergence, each sampled configuration
    represents random draw from f(HG).

41
PHASE algorithm (4)
  • How to obtain f(HiG,H-i)?
  • Base directly on sample frequency of observed
    haplotypes in configuration H-i.
  • Better to introduce prior model for population
    haplotype frequencies, f(h).
  • Coalescent process used to predict likely
    patterns of haplotypes occurring in populations.

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
PHASE algorithm (5)
  • Key principle
  • Configuration Hi is more likely to consist of
    haplotypes Hi1 and Hi2 that are exactly the same
    as, or similar to, haplotypes in the
    configuration H-i.

47
Example (1)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 35 34 44.

48
Example (1)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 35 34 44.
  • Possible configurations
  • (1) 22334 / 22544
  • (2) 22534 / 22344

49
Example (1)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 35 34 44.
  • Possible configurations
  • (1) 22334 / 22544
  • (2) 22534 / 22344 -1 and 1

50
Example (1)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 35 34 44.
  • Possible configurations
  • (1) 22334 / 22544
  • (2) 22534 / 22344 -1 and 1
  • Assign high probability to sampling configuration
    (1).

51
Example (2)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 46 34 44.

52
Example (2)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 46 34 44.
  • Possible configurations
  • (1) 22434 / 22644
  • (2) 22634 / 22444

53
Example (2)
  • Resolved haplotypes 22544 and 22334.
  • Unresolved individual 22 22 46 34 44.
  • Possible configurations
  • (1) 22434 / 22644 1 and 1
  • (2) 22634 / 22444 3 and -1
  • Assign high probability to sampling configuration
    (1).

54
PHASE algorithm comments
  • Allows for uncertainty in haplotype
    reconstruction in Bayesian framework.
  • Can handle missing data.
  • Coalescent process does not explicitly allow for
    recombination, but performs well even when
    cross-over events occur (up to 0.1cM).
  • Up to 50 more efficient than Clarks algorithm
    or the E-M algorithm.

55
PHASE algorithm output
  • Best reconstruction output for each individual.
  • Uncertainty in reconstruction indicated by system
    of brackets
  • inferred missing genotype uncertain with
    posterior probability less than specified
    threshold
  • () inferred phase assignment uncertain with
    posterior probability less than specified
    threshold.
  • 0 (1) 0 0 1 (0)
  • 0 (0) 1 0 1 (1)

56
PHASE algorithm interpretation
  • Best reconstruction not necessarily correct.
  • Uncertain haplotype configurations should be
    investigated further.
  • Effective targeting of additional genotyping
    costs.

57
Other Bayesian MCMC algorithms
  • HAPLOTYPER
  • Prior model for haplotype frequencies given by
    Dirichelet distribution.
  • Deals with large number of SNPs by partition
    ligation.
  • Outputs best reconstruction with uncertainty
    measured by posterior probability.
  • HAPMCMC
  • Log-linear prior model for haplotype frequencies
    incorporating interactions corresponding to first
    order LD between SNPs.
  • Designed specifically for investigating LD across
    small genomic regions.

58
Dont ignore uncertainty
  • Tempting to treat best configuration as correct
    in subsequent analyses.
  • Can seriously affect inferences
  • Example LD across small genomic regions

59
False positive error rates
Level E-M PHASE (BEST) HAPMCMC (BEST) HAPMCMC (POST)
0.01 0.008 0.204 0.060 0.009
0.05 0.045 0.362 0.183 0.056
0.1 0.103 0.446 0.247 0.091
0.2 0.212 0.566 0.392 0.201
0.5 0.524 0.767 0.640 0.527
60
Two stage analyses
  • For many studies, haplotype reconstruction is
    just an intermediate step required for a second
    stage of analysis.
  • We dont actually care what the haplotypes are
    missing data that we must account for in second
    stage of analysis.
  • Better to develop methods that allow for
    uncertainty in haplotype configuration in
    unresolved individuals, rather than haplotype
    reconstruction methods per-se.

61
Summary
  • Haplotype information required for analysis of
    high-density marker data.
  • Many algorithms available.
  • Interpret output with care.
  • Reconstructed haplotypes often not of interest,
    but uncertainty must be accounted for in
    subsequent analyses.
Write a Comment
User Comments (0)
About PowerShow.com