FINE SCALE MAPPING - PowerPoint PPT Presentation

About This Presentation
Title:

FINE SCALE MAPPING

Description:

Gene trees and the coalescent process. Genetic heterogeneity and shattered gene trees. ... Refine location of putative disease locus within region. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 60
Provided by: amo132
Category:
Tags: fine | mapping | scale | putative

less

Transcript and Presenter's Notes

Title: FINE SCALE MAPPING


1
FINE SCALE MAPPING
  • ANDREW MORRIS
  • Wellcome Trust Centre for Human Genetics
  • March 7, 2003

2
Outline
  • Introduction fine scale mapping using
    high-density SNP haplotype data.
  • Bayesian framework.
  • Gene trees and the coalescent process.
  • Genetic heterogeneity and shattered gene trees.
  • Markov chain Monte Carlo (MCMC) algorithm.
  • SNP genotype data.
  • Example cystic fibrosis.

3
Introduction
  • Candidate region of the order of 1Mb in length.
  • Refine location of putative disease locus within
    region.
  • Make use of high-density maps of single
    nucleotide polymorphisms (SNPs).
  • Type sample of affected cases and unaffected
    controls.

4
Once upon a time
  • Disease predisposition determined by single locus
    in candidate region.
  • Each case chromosome carries a copy of a disease
    allele, resulting from a single recent mutation
    event at disease locus.
  • Each control chromosome carries a copy of the
    ancient normal allele at the disease locus.

5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
In an ideal world
  • Excess sharing of SNP haplotypes in the vicinity
    of the disease locus, among cases and not among
    controls.
  • Decreased probability of sharing as distance from
    disease locus increases.
  • Approximate location of disease locus inferred.

9
Problems
  • Gene tree and ancestral haplotypes are unknown.
  • Marker mutations lead to mismatch of alleles
    within preserved regions.
  • Multiple disease genes, multiple mutations, and
    dominance.

10
Example Cystic fibrosis (CF)
  • Fully penetrant recessive disorder, incidence
    1/2500 live births in white populations, less
    common in other populations.
  • Preliminary linkage analysis suggested 1.8Mb
    candidate region for a single CF gene on
    chromosome 7q31.
  • More recently, a 3bp deletion, ?F508, has been
    identified in the CFTR gene at 0.88Mb into the
    candidate region.
  • Now known that ?F508 accounts for 66 of all
    chromosomal mutations in individuals with CF.
  • Remainder of CF chromosomes carry copies of many
    other rare mutations in the same gene.
  • 23 RFLPs used to identify haplotypes in 92
    control chromosomes and 94 case chromosomes, 62
    of which have been confirmed to carry ?F508.

11
(No Transcript)
12
(No Transcript)
13
Challenges
  • The ?F508 locus does not lie at the centre of the
    region of high LD.
  • Non-?F508 case chromosomes are not expected to
    share the same founder marker haplotype.
  • Useful test-data set for fine-scale mapping
    methods

14
(No Transcript)
15
Challenges
  • The ?F508 locus does not lie at the centre of the
    region of high LD.
  • Non-?F508 case chromosomes are not expected to
    share the same founder marker haplotype.
  • Useful test-data set for fine-scale mapping
    methods

16
Published methods
17
Bayesian framework (1)
  • Assume disease locus exists in candidate region
    aim is then to estimate its location.
  • Approximate the posterior distribution of
    location.
  • Allows assignment of probabilities that disease
    locus lies in any particular area of the
    candidate region.

18
Bayesian framework (2)
  • Aim is to approximate the posterior density of
    location of the disease locus, given SNP
    haplotypes in cases A and controls U, denoted
    f(xA,U).
  • Depends on other model parameters M, including
    gene tree, population haplotype frequencies, etc
  • Recover marginal posterior density by integration
    over these nuisance parameters,
  • f(xA,U) ?f(x,MA,U)dM

19
Bayesian framework (3)
  • By Bayes Theorem
  • f(x,MA,U) C f(A,Ux,M) f(x,M)
  • Normalising constant.
  • Likelihood of haplotype data given model
    parameters M and location x.
  • Prior density of M and x.

20
Bayesian framework (3)
  • By Bayes Theorem
  • f(x,MA,U) C f(A,Ux,M) f(x,M)
  • Normalising constant.
  • Likelihood of haplotype data given model
    parameters M and location x.
  • Prior density of M and x.

21
Bayesian framework (3)
  • By Bayes Theorem
  • f(x,MA,U) C f(A,Ux,M) f(x,M)
  • Normalising constant.
  • Likelihood of haplotype data given model
    parameters M and location x.
  • Prior density of M and x.

22
Bayesian framework (3)
  • By Bayes Theorem
  • f(x,MA,U) C f(A,Ux,M) f(x,M)
  • Normalising constant.
  • Likelihood of haplotype data given model
    parameters M and location x.
  • Prior density of M and x.

23
Control chromosomes
  • Assumed to carry an ancient normal allele at the
    disease locus.
  • Effects of recent shared ancestry of less
    importance, so simple model assumed
  • f(A,Ux,M) f(Ax,M) f(Uh)
  • The likelihood, f(Uh), depends only on
    population SNP haplotype frequencies, h.
  • For many SNPs, the number of possible haplotypes
    is large, so frequencies are parameterised in
    terms of allele frequencies and first-order LD
    between pairs of adjacent loci.

24
Gene trees
  • Representation of the recent shared ancestry of
    case chromosomes at the disease locus.
  • Star shaped tree each case chromosome descends
    independently from founder. Assumes there is too
    much information in sample about ancestral
    recombination and mutation events.
  • Bifurcating tree shared ancestral recombination
    and mutation events between chromosomes appear
    only once in their shared ancestry.

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Gene trees
  • Representation of the recent shared ancestry of
    case chromosomes at the disease locus.
  • Star shaped tree each case chromosome descends
    independently from founder. Assumes there is too
    much information in sample about ancestral
    recombination and mutation events.
  • Bifurcating tree shared ancestral recombination
    and mutation events between chromosomes appear
    only once in their shared ancestry.

30
Tree specification
  • Topology T the branching pattern of the tree.
  • Branch lengths, t, determined by the waiting
    times, w, between merging events in the gene
    tree.
  • Scaled in units of 2N generations, where N is
    effective population size.

Root
Leaf nodes
31
Prior probability model
  • Uniform prior probability model for population
    haplotype frequencies, the location of disease
    locus, and the effective population size.
  • Each gene tree topology has equal prior
    probability.
  • Prior probability model reduces to
  • f(x,M) C f(w)
  • Need prior probability model for waiting times
    between merging events.

32
The coalescent process (1)
  • Time between merging event from k to k-1
    lineages.
  • Scaled in units of 2N generations.
  • Exponential distribution with rate k(k-1)/2.

33
The coalescent process (1)
  • Time between merging event from k to k-1
    lineages.
  • Scaled in units of 2N generations.
  • Exponential distribution with rate k(k-1)/2.

Exponential rate 8x7/2 28 Expected time 0.0357
34
The coalescent process (1)
  • Time between merging event from k to k-1
    lineages.
  • Scaled in units of 2N generations.
  • Exponential distribution with rate k(k-1)/2.

Exponential rate 7x6/221 Expected time 0.0476
35
The coalescent process (1)
  • Time between merging event from k to k-1
    lineages.
  • Scaled in units of 2N generations.
  • Exponential distribution with rate k(k-1)/2.

Exponential rate 2x1/21 Expected time 1
36
The coalescent process (2)
  • Assumes constant effective population size, N.
  • Flexible can allow for exponential population
    growth and population sub-structure.
  • Assumes sample is ascertained at random from the
    population. Problem case chromosomes ascertained
    because they carry a copy of the disease
    mutation.
  • Assumes sample has single common ancestor.
    Problem genetic heterogeneity.

37
The shattered coalescent model
  • Generalisation of the coalescent process to allow
    branches of the gene tree to be removed.
  • Introduce indicator variable, zb, for each node,
    b, taking the value 1 if b has a parent in the
    gene tree and 0 otherwise.
  • Allows for singleton leaf nodes, corresponding to
    sporadic case chromosomes, and disconnected
    sub-trees, corresponding to independent mutation
    events at the same disease locus.
  • Assume number of branches of gene tree not
    removed in the shattered coalescent process given
    by binomial distribution, with shattering
    parameter ?.

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Ancestral haplotypes
  • Haplotypes, I, carried by internal nodes of the
    gene tree are unknown.
  • To calculate posterior probability, need to
    integrate over distribution of possible ancestral
    haplotypes, which depends on gene tree and other
    model parameters.
  • Treated as augmented data in Bayesian framework
    enters posterior probability through likelihood
  • f(xA,U) ? ? f(x,M,IA,U)dMdI
  • and
  • f(x,M,IA,U) C f(A,U,Ix,M) f(x,M)

42
Likelihood calculations
  • If node has no parent in shattered gene tree,
    treat as a random chromosome from the population
    (sporadic or founder for mutation).
  • If node has parent in genealogy, depends on
    marker haplotype carried by the parental node,
    and the occurrence of recombination and mutation
    events along the connecting branch.

43
Likelihood calculations
  • If node has no parent in shattered gene tree,
    treat as a random chromosome from the population
    (sporadic or founder for mutation).
  • If node has parent in genealogy, depends on
    marker haplotype carried by the parental node,
    and the occurrence of recombination and mutation
    events along the connecting branch.

44
MCMC algorithm (1)
  • Need to calculate joint posterior distribution
    f(x,h,T,w,z,N,?,IA,U).
  • Parameter space extremely complex, so cannot be
    calculated analytically.
  • Markov chain Monte Carlo (MCMC) algorithm
    approximates the posterior distribution by
    sampling from f(x,h,T,w,z,N,?,IA,U).
  • Computationally intensive, but becoming more
    practical with improvements in computing power.
  • Can handle missing SNP data treat as augmented
    data in the same way as ancestral haplotypes.

45
MCMC algorithm (2)
  • Let S denote current set of model parameters
    x,h,T,w,z,N,?,I.
  • Propose small change to model parameters, S.
  • Accept S in place of S with probability
    f(SA,U)/f(SA,U).
  • If S is not accepted, the current parameter S is
    retained.
  • Initial burn-in to allow convergence of f(SA,U)
    from random starting parameter set.
  • Subsequent sampling period, parameter set
    recorded every rth step of the algorithm each
    recorded output represents a random draw from
    f(SA,U).

46
MCMC algorithm (3)
Tree height
Location
?
N
101 0.47374 2557.62766 4.24189612
10849.19083 0.78104 -1769.51173 102 0.40629
2112.19993 4.16846454 8804.63049 0.79777
-1788.66623 103 0.46534 1679.71719
4.30423786 7229.90233 0.75364 -1854.19049
104 0.48211 2229.24788 4.33740414
9669.14899 0.78009 -1763.70173 105 0.43808
2402.10599 4.29011844 10305.31919 0.82178
-1760.56671 106 0.44607 2275.33453
4.03331587 9177.14285 0.82601 -1775.90300
107 0.41822 3016.70273 4.39000994
13243.35496 0.77768 -1844.20629 108 0.40934
2534.50113 4.07270615 10322.27832 0.81590
-1861.97411 109 0.41032 3122.91416
4.25386813 13284.46504 0.82479 -1814.27448
110 0.45020 3209.14218 4.34316471
13937.83307 0.78422 -1801.44160
Log posterior probability
47
MCMC algorithm (3)
Tree height
Location
?
N
101 0.47374 2557.62766 4.24189612
10849.19083 0.78104 -1769.51173 102 0.40629
2112.19993 4.16846454 8804.63049 0.79777
-1788.66623 103 0.46534 1679.71719
4.30423786 7229.90233 0.75364 -1854.19049
104 0.48211 2229.24788 4.33740414
9669.14899 0.78009 -1763.70173 105 0.43808
2402.10599 4.29011844 10305.31919 0.82178
-1760.56671 106 0.44607 2275.33453
4.03331587 9177.14285 0.82601 -1775.90300
107 0.41822 3016.70273 4.39000994
13243.35496 0.77768 -1844.20629 108 0.40934
2534.50113 4.07270615 10322.27832 0.81590
-1861.97411 109 0.41032 3122.91416
4.25386813 13284.46504 0.82479 -1814.27448
110 0.45020 3209.14218 4.34316471
13937.83307 0.78422 -1801.44160
Log posterior probability
48
Cystic fibrosis revisited
  • Assume a fixed recombination rate of 0.5cM per Mb
    and a marker mutation rate of 2.5 x 10-5 per
    locus, per generation.
  • Each run of MCMC algorithm begins with 20,000
    step burn-in period thrown away.
  • Subsequent 200,000 step sampling period, output
    recorded every 50th step of the algorithm 4000
    outputs.
  • Two analyses of CF data performed control
    chromosomes (92) and (i) ?F508 case chromosomes
    (62) only (ii) all case chromosomes (94).

49
(No Transcript)
50
Cystic fibrosis summary statistics
51
Cystic fibrosis genetic heterogeneity
  • Structure of shattered gene tree provides
    information about genetic heterogeneity at
    disease locus.
  • For each output of MCMC algorithm, record
    shattered gene tree.
  • For each pair of chromosomes, record whether they
    appear in the same sub-tree.
  • Over all outputs, estimate probability that each
    pair of chromosomes carry the same allele at the
    disease locus.
  • Cluster chromosomes according to these
    probabilities cladogram to represent genetic
    heterogeneity.

52
(No Transcript)
53
(No Transcript)
54
SNP genotype data
  • SNP haplotype rarely available.
  • Could infer haplotypes from SNP genotype data
    PHASE, SNPHAP, HAPLOTYPER algorithms.
  • Better to treat haplotypes as augmented data in
    Bayesian framework
  • f(xG) ? ? ? ? f(x,M,I,A,UG)dMdIdAdU
  • and
  • f(x,M,I,A,UG) C f(A,U,Ix,M) f(x,M)

55
Cystic fibrosis revisited again!
  • Create genotype data from original CF haplotype
    data.
  • Pair together case chromosmes at random.
  • Pair together control chromosomes at random.
  • Total sample 46 controls and 47 cases.

56
(No Transcript)
57
Cystic fibrosis genotypes v haplotypes
58
Limitations
  • Computationally intensive limited to sample
    sizes 100 cases and controls with up to 20 SNPs.
  • Alternative approach do not model gene tree
    explicitly estimate shattered gene tree using
    standard clustering methods.

59
Summary
  • High density SNP map of the human genome now
    available.
  • Fine scale mapping of disease loci requires
    effective modelling of shared ancestry of sample
    of case and control chromosomes.
  • Methods exist for haplotype and genotype data
    MCMC algorithms are very computationally
    intensive and are currently limited to relatively
    small sample sizes.
  • Further development is necessary
Write a Comment
User Comments (0)
About PowerShow.com