Title: FINE SCALE MAPPING
1FINE SCALE MAPPING
- ANDREW MORRIS
- Wellcome Trust Centre for Human Genetics
- March 7, 2003
2Outline
- Introduction fine scale mapping using
high-density SNP haplotype data. - Bayesian framework.
- Gene trees and the coalescent process.
- Genetic heterogeneity and shattered gene trees.
- Markov chain Monte Carlo (MCMC) algorithm.
- SNP genotype data.
- Example cystic fibrosis.
3Introduction
- Candidate region of the order of 1Mb in length.
- Refine location of putative disease locus within
region. - Make use of high-density maps of single
nucleotide polymorphisms (SNPs). - Type sample of affected cases and unaffected
controls.
4Once upon a time
- Disease predisposition determined by single locus
in candidate region. - Each case chromosome carries a copy of a disease
allele, resulting from a single recent mutation
event at disease locus. - Each control chromosome carries a copy of the
ancient normal allele at the disease locus.
5(No Transcript)
6(No Transcript)
7(No Transcript)
8In an ideal world
- Excess sharing of SNP haplotypes in the vicinity
of the disease locus, among cases and not among
controls. - Decreased probability of sharing as distance from
disease locus increases. - Approximate location of disease locus inferred.
9Problems
- Gene tree and ancestral haplotypes are unknown.
- Marker mutations lead to mismatch of alleles
within preserved regions. - Multiple disease genes, multiple mutations, and
dominance.
10Example Cystic fibrosis (CF)
- Fully penetrant recessive disorder, incidence
1/2500 live births in white populations, less
common in other populations. - Preliminary linkage analysis suggested 1.8Mb
candidate region for a single CF gene on
chromosome 7q31. - More recently, a 3bp deletion, ?F508, has been
identified in the CFTR gene at 0.88Mb into the
candidate region. - Now known that ?F508 accounts for 66 of all
chromosomal mutations in individuals with CF. - Remainder of CF chromosomes carry copies of many
other rare mutations in the same gene. - 23 RFLPs used to identify haplotypes in 92
control chromosomes and 94 case chromosomes, 62
of which have been confirmed to carry ?F508.
11(No Transcript)
12(No Transcript)
13Challenges
- The ?F508 locus does not lie at the centre of the
region of high LD. - Non-?F508 case chromosomes are not expected to
share the same founder marker haplotype. - Useful test-data set for fine-scale mapping
methods
14(No Transcript)
15Challenges
- The ?F508 locus does not lie at the centre of the
region of high LD. - Non-?F508 case chromosomes are not expected to
share the same founder marker haplotype. - Useful test-data set for fine-scale mapping
methods
16Published methods
17Bayesian framework (1)
- Assume disease locus exists in candidate region
aim is then to estimate its location. - Approximate the posterior distribution of
location. - Allows assignment of probabilities that disease
locus lies in any particular area of the
candidate region.
18Bayesian framework (2)
- Aim is to approximate the posterior density of
location of the disease locus, given SNP
haplotypes in cases A and controls U, denoted
f(xA,U). - Depends on other model parameters M, including
gene tree, population haplotype frequencies, etc - Recover marginal posterior density by integration
over these nuisance parameters, - f(xA,U) ?f(x,MA,U)dM
19Bayesian framework (3)
- By Bayes Theorem
- f(x,MA,U) C f(A,Ux,M) f(x,M)
- Normalising constant.
- Likelihood of haplotype data given model
parameters M and location x. - Prior density of M and x.
20Bayesian framework (3)
- By Bayes Theorem
- f(x,MA,U) C f(A,Ux,M) f(x,M)
- Normalising constant.
- Likelihood of haplotype data given model
parameters M and location x. - Prior density of M and x.
21Bayesian framework (3)
- By Bayes Theorem
- f(x,MA,U) C f(A,Ux,M) f(x,M)
- Normalising constant.
- Likelihood of haplotype data given model
parameters M and location x. - Prior density of M and x.
22Bayesian framework (3)
- By Bayes Theorem
- f(x,MA,U) C f(A,Ux,M) f(x,M)
- Normalising constant.
- Likelihood of haplotype data given model
parameters M and location x. - Prior density of M and x.
23Control chromosomes
- Assumed to carry an ancient normal allele at the
disease locus. - Effects of recent shared ancestry of less
importance, so simple model assumed - f(A,Ux,M) f(Ax,M) f(Uh)
- The likelihood, f(Uh), depends only on
population SNP haplotype frequencies, h. - For many SNPs, the number of possible haplotypes
is large, so frequencies are parameterised in
terms of allele frequencies and first-order LD
between pairs of adjacent loci.
24Gene trees
- Representation of the recent shared ancestry of
case chromosomes at the disease locus. - Star shaped tree each case chromosome descends
independently from founder. Assumes there is too
much information in sample about ancestral
recombination and mutation events. - Bifurcating tree shared ancestral recombination
and mutation events between chromosomes appear
only once in their shared ancestry.
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Gene trees
- Representation of the recent shared ancestry of
case chromosomes at the disease locus. - Star shaped tree each case chromosome descends
independently from founder. Assumes there is too
much information in sample about ancestral
recombination and mutation events. - Bifurcating tree shared ancestral recombination
and mutation events between chromosomes appear
only once in their shared ancestry.
30Tree specification
- Topology T the branching pattern of the tree.
- Branch lengths, t, determined by the waiting
times, w, between merging events in the gene
tree. - Scaled in units of 2N generations, where N is
effective population size.
Root
Leaf nodes
31Prior probability model
- Uniform prior probability model for population
haplotype frequencies, the location of disease
locus, and the effective population size. - Each gene tree topology has equal prior
probability. - Prior probability model reduces to
- f(x,M) C f(w)
- Need prior probability model for waiting times
between merging events.
32The coalescent process (1)
- Time between merging event from k to k-1
lineages. - Scaled in units of 2N generations.
- Exponential distribution with rate k(k-1)/2.
33The coalescent process (1)
- Time between merging event from k to k-1
lineages. - Scaled in units of 2N generations.
- Exponential distribution with rate k(k-1)/2.
Exponential rate 8x7/2 28 Expected time 0.0357
34The coalescent process (1)
- Time between merging event from k to k-1
lineages. - Scaled in units of 2N generations.
- Exponential distribution with rate k(k-1)/2.
Exponential rate 7x6/221 Expected time 0.0476
35The coalescent process (1)
- Time between merging event from k to k-1
lineages. - Scaled in units of 2N generations.
- Exponential distribution with rate k(k-1)/2.
Exponential rate 2x1/21 Expected time 1
36The coalescent process (2)
- Assumes constant effective population size, N.
- Flexible can allow for exponential population
growth and population sub-structure. - Assumes sample is ascertained at random from the
population. Problem case chromosomes ascertained
because they carry a copy of the disease
mutation. - Assumes sample has single common ancestor.
Problem genetic heterogeneity.
37The shattered coalescent model
- Generalisation of the coalescent process to allow
branches of the gene tree to be removed. - Introduce indicator variable, zb, for each node,
b, taking the value 1 if b has a parent in the
gene tree and 0 otherwise. - Allows for singleton leaf nodes, corresponding to
sporadic case chromosomes, and disconnected
sub-trees, corresponding to independent mutation
events at the same disease locus. - Assume number of branches of gene tree not
removed in the shattered coalescent process given
by binomial distribution, with shattering
parameter ?.
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Ancestral haplotypes
- Haplotypes, I, carried by internal nodes of the
gene tree are unknown. - To calculate posterior probability, need to
integrate over distribution of possible ancestral
haplotypes, which depends on gene tree and other
model parameters. - Treated as augmented data in Bayesian framework
enters posterior probability through likelihood - f(xA,U) ? ? f(x,M,IA,U)dMdI
- and
- f(x,M,IA,U) C f(A,U,Ix,M) f(x,M)
42Likelihood calculations
- If node has no parent in shattered gene tree,
treat as a random chromosome from the population
(sporadic or founder for mutation). - If node has parent in genealogy, depends on
marker haplotype carried by the parental node,
and the occurrence of recombination and mutation
events along the connecting branch.
43Likelihood calculations
- If node has no parent in shattered gene tree,
treat as a random chromosome from the population
(sporadic or founder for mutation). - If node has parent in genealogy, depends on
marker haplotype carried by the parental node,
and the occurrence of recombination and mutation
events along the connecting branch.
44MCMC algorithm (1)
- Need to calculate joint posterior distribution
f(x,h,T,w,z,N,?,IA,U). - Parameter space extremely complex, so cannot be
calculated analytically. - Markov chain Monte Carlo (MCMC) algorithm
approximates the posterior distribution by
sampling from f(x,h,T,w,z,N,?,IA,U). - Computationally intensive, but becoming more
practical with improvements in computing power. - Can handle missing SNP data treat as augmented
data in the same way as ancestral haplotypes.
45MCMC algorithm (2)
- Let S denote current set of model parameters
x,h,T,w,z,N,?,I. - Propose small change to model parameters, S.
- Accept S in place of S with probability
f(SA,U)/f(SA,U). - If S is not accepted, the current parameter S is
retained. - Initial burn-in to allow convergence of f(SA,U)
from random starting parameter set. - Subsequent sampling period, parameter set
recorded every rth step of the algorithm each
recorded output represents a random draw from
f(SA,U).
46MCMC algorithm (3)
Tree height
Location
?
N
101 0.47374 2557.62766 4.24189612
10849.19083 0.78104 -1769.51173 102 0.40629
2112.19993 4.16846454 8804.63049 0.79777
-1788.66623 103 0.46534 1679.71719
4.30423786 7229.90233 0.75364 -1854.19049
104 0.48211 2229.24788 4.33740414
9669.14899 0.78009 -1763.70173 105 0.43808
2402.10599 4.29011844 10305.31919 0.82178
-1760.56671 106 0.44607 2275.33453
4.03331587 9177.14285 0.82601 -1775.90300
107 0.41822 3016.70273 4.39000994
13243.35496 0.77768 -1844.20629 108 0.40934
2534.50113 4.07270615 10322.27832 0.81590
-1861.97411 109 0.41032 3122.91416
4.25386813 13284.46504 0.82479 -1814.27448
110 0.45020 3209.14218 4.34316471
13937.83307 0.78422 -1801.44160
Log posterior probability
47MCMC algorithm (3)
Tree height
Location
?
N
101 0.47374 2557.62766 4.24189612
10849.19083 0.78104 -1769.51173 102 0.40629
2112.19993 4.16846454 8804.63049 0.79777
-1788.66623 103 0.46534 1679.71719
4.30423786 7229.90233 0.75364 -1854.19049
104 0.48211 2229.24788 4.33740414
9669.14899 0.78009 -1763.70173 105 0.43808
2402.10599 4.29011844 10305.31919 0.82178
-1760.56671 106 0.44607 2275.33453
4.03331587 9177.14285 0.82601 -1775.90300
107 0.41822 3016.70273 4.39000994
13243.35496 0.77768 -1844.20629 108 0.40934
2534.50113 4.07270615 10322.27832 0.81590
-1861.97411 109 0.41032 3122.91416
4.25386813 13284.46504 0.82479 -1814.27448
110 0.45020 3209.14218 4.34316471
13937.83307 0.78422 -1801.44160
Log posterior probability
48Cystic fibrosis revisited
- Assume a fixed recombination rate of 0.5cM per Mb
and a marker mutation rate of 2.5 x 10-5 per
locus, per generation. - Each run of MCMC algorithm begins with 20,000
step burn-in period thrown away. - Subsequent 200,000 step sampling period, output
recorded every 50th step of the algorithm 4000
outputs. - Two analyses of CF data performed control
chromosomes (92) and (i) ?F508 case chromosomes
(62) only (ii) all case chromosomes (94).
49(No Transcript)
50Cystic fibrosis summary statistics
51Cystic fibrosis genetic heterogeneity
- Structure of shattered gene tree provides
information about genetic heterogeneity at
disease locus. - For each output of MCMC algorithm, record
shattered gene tree. - For each pair of chromosomes, record whether they
appear in the same sub-tree. - Over all outputs, estimate probability that each
pair of chromosomes carry the same allele at the
disease locus. - Cluster chromosomes according to these
probabilities cladogram to represent genetic
heterogeneity.
52(No Transcript)
53(No Transcript)
54SNP genotype data
- SNP haplotype rarely available.
- Could infer haplotypes from SNP genotype data
PHASE, SNPHAP, HAPLOTYPER algorithms. - Better to treat haplotypes as augmented data in
Bayesian framework - f(xG) ? ? ? ? f(x,M,I,A,UG)dMdIdAdU
- and
- f(x,M,I,A,UG) C f(A,U,Ix,M) f(x,M)
55Cystic fibrosis revisited again!
- Create genotype data from original CF haplotype
data. - Pair together case chromosmes at random.
- Pair together control chromosomes at random.
- Total sample 46 controls and 47 cases.
56(No Transcript)
57Cystic fibrosis genotypes v haplotypes
58Limitations
- Computationally intensive limited to sample
sizes 100 cases and controls with up to 20 SNPs. - Alternative approach do not model gene tree
explicitly estimate shattered gene tree using
standard clustering methods.
59Summary
- High density SNP map of the human genome now
available. - Fine scale mapping of disease loci requires
effective modelling of shared ancestry of sample
of case and control chromosomes. - Methods exist for haplotype and genotype data
MCMC algorithms are very computationally
intensive and are currently limited to relatively
small sample sizes. - Further development is necessary