Title: Advances in Populationbased Studies of Complex Genetic Disorders
1Advances in Population-based Studies of Complex
Genetic Disorders
- n i h e s Programme
- March 31 April 4, 2003
- Erasmus MC
- Rotterdam
Faculty Yurri Aulchenko, David Clayton, Cornelia
van Duijn, Susan Service
2Programme Overview
- Day 1 Basic Principles of Association
- Day 2 Population-based Studies
- Day 3 Family-based Studies
- Day 4 Linkage Disequilibrium and
Haplotyping - Day 5 Isolated Populations
3Day 1 Basic Principles of Association
- Hardy-Weinberg Equilibrium
- Measures of Association
- Study Designs Case Control
4Hardy-Weinberg Equilibrium
- In a population, allele and genotype frequencies
will remain constant over generations
5HWE
- Can compare allele frequencies between cases and
controls under HWE only - If not in HWE, then parental allele transmissions
have not been independent / uncorrelated - Therefore, classical statistics cant be applied
6Measures of Association
- In epidemiology, associations between disease and
aetiological factors are usually expressed in
terms of relative risk measures - In the simplest case
- measure of disease risk in exposed subjects /
same measure of risk in unexposed subjects - Relative risks may be defined for genotypes,
alleles or haplotypes
7Genotype Relative Risks
- For a biallelic locus with alleles A, a, there
are three genotypes AA, Aa, aa - We usually take one of these, e.g. aa, as
reference GRRAA Risk for AA / Risk for aa - GRRAa Risk for Aa/ Risk for aa
- No standard in which should be taken as
reference, but usually the commonest - CI easier to interpret when the reference is
common
8Allelic Relative Risks
- Allelic relative risks, fA and fa are defined by
the multiplicative model - One allele, e.g. a, is taken as reference so that
fa1 - GRRAA(fA)2 and GRRAafA
- Assume HWE each subjects 2 chromosomes are
sampled independently from the population
9Study Designs Case Control
- Healthy controls vs. Population controls
- Healthy controls matched for age, gender etc
- more power
- Population controls randomly drawn
- cost-effective
- Usefulness of controls drops after 4 controls to
1 case (max)
10Testing H0 against alternative models
- Multiplicative model allele-wise comparison
- Dominant model carriers vs. non-carriers
- Recessive model homozygotes vs. rest
- Must have reason to think the effect is dominant
/ recessive - If we dont know the best compromise is the
multiplicative model - If all 3 tests are carried out, must correct for
multiple, non-independent testing randomise cc
status and repeat permutations on 3 tests
11Day 2 Population-based Studies
- Multiple Comparison Problems
- Confounding and Stratification
- Study Design Matching
12Possible Outcomes of a Statistical Test
aprobability of false ve, bprobability of
false ve 1-bpower of the test
13Correcting for Multiple Testing
- Traditional solution the Bonferroni Method
- for k tests, reject H0 at a/k level for each
test - Controls the probability to falsely reject at
least 1 H0 - Overly conservative for large k and / or
dependent tests diminished power - Appropriate when expecting 0 or 1 H0 to be false
14New Paradigm the False Discovery Rate
- Controls the rate of false positives
- More appropriate when expecting several H0 to be
false / when tests are correlated - More liberal than traditional methods
- Greatly increases power
- Order p values from n tests smallest to largest
- FDR threshold is stringent for 1st test, but gets
less stringent as number of tests is reduced
15Confounding and Stratification
- Spurious associations can be due to confounding
by population stratification - Can avoid difficulties by analysing within strata
- Stratified analysis
- - loss of power due to little data in each
stratum - - useful when different strata are associated
with different alleles / inverse effect
of same allele
16Different approach
- Assume that the same effect exists across strata
- Sum contributions from each stratum
- Can use logistic regression
17Study Design Matching
- Matching maintain the same ratio of controls to
cases in every stratum - Also better sampling of controls
- Individually matched studies each case has
his/her own set of controls, defining a stratum
conditional logistic regression must be used - Overmatching matching for a variable which,
while not a confounder, is related to the factor
of interest reduction of effective sample size
18Unobserved Stratification
- Random differences in allele frequencies between
strata - Two ways of tackling this have been proposed
- Estimate unobserved stratification empirically
Devlin and Roeders genome-wide control - Unobserved stratification generates deviation
from HWE and apparent LD between distant markers.
Thus latent stratification can be modelled
19Day 3 Family-based Studies
- The Transmission Disequilibrium Test
- Parental Origin
- Quantitative Traits
20Transmission Disequilibrium Test
- The use of family-based controls overcomes the
effects of unmeasured population stratification - Case-parent trios conditioning on parental
genotypes the TDT
i/j 1/3 each with probability 0.25
1/4 2/3 2/4
21Reconstructing Missing Parental Genotypes
1/2
?/?
1/2
?/?
1/3
1/2
1/2
?/?
1/1
22Two Unlinked Loci
- For 2 unlinked loci, there are 16 transmission
patterns, which are equiprobable in the
population - Can compare each case with 15 pseudocontrols
- Can look for GxG interactions using conditional
logistic regression - Less efficient than the case-only method, but
resistant to population stratification
23Parental Origin
- An important aspect of the TDT is the ability to
differentiate allelic effects based on parental
origin - The following intercross triads excluded from
analysis - Several methods TAT, PAT, CPG, CEPG
1/2
1/2
1/2
24Quantitative Traits
- The weighted TDT (implemented in FBAT) a
conditional logistic model - Genotype effects appear as interactions with
(y-m), where ytrait value of offspring and
mpopulation mean for trait - QTDT robust to stratification / admixture
- Regression of trait value y on genotype score g
- Parent-of-origin effect also implemented in
software
25Day 4 LD and Haplotyping
- Measures of LD
- LD Problems the Bias
- Estimating Haplotype Frequencies
- Haplotype Blocks
26Measures of LD
- p11frequency of 11 haplotype, etc
- Most LD measures are based on the covariance
D p11 p22 - p21 p12
27Lewontins D
- D D / Dmax, where
- Dmax min(p1.p.1, p2.p.2) if Dlt0
- Dmax min(p1.p.2 , p2.p.1) if D0
28LD Correlation Coefficient r2
- D / vp1.p2.p.1p.2 is a measure of correlation
- r2 D2 / p1.p2.p.1p.2
- the square of the correlation between marker
alleles - c2 for the 2x2 table is c2 ND2 / p1.p2.p.1p.2
- with 1 df, where N sample size
- p value for the significance of LD between the
markers - r2 is well-related with p value from c2
29LD Problems the Bias
- D is biased upwards with smaller sample size
- Correct the problem by
- - using different measures of LD, e.g. r2
- - use bootstrap Dboo
- - use permutation Dadj
30Estimating Haplotype Frequencies
- In the absence of family data
- Expectation-Maximization algorithm
-
- Markov chain Monte Carlo algorithm
31EM algorithm
- Maximum likelihood technique
- Goal find haplotype frequency that maximises
probability of observed genotypes - Assumption HWE at all loci
- Limitation EM estimate may not be global optimum
- - use different starting conditions to avoid
convergence to local maximum
32Markov chain Monte Carlo algorithm
- Another approximation method
- Uses sampling to estimate expectations
- Operates on one persons haplotype resolution at
a time - MCMC can handle larger problems than EM
- MCMC provides estimates of uncertainty on
phase-unknown calls - Monitoring convergence on MCMC can be hard
33Haplotype Blocks
- n SNPs 2n possible haplotypes
- Regions of extended haplotype conservation
- Definition by minimising haplotype diversity
- by identifying regions with low recombination
rate (D) - using common SNPs?
- excluding low frequency haplotypes?
Different methods/ thresholds result in different
block structure
34HB use in Association Mapping Possible Strategy
- Genotype a subset of samples for all SNPs
- Define HBs and htSNPs
- Genotype entire sample for htSNPs
- Investigate association
- Will reduce cost
- Will facilitate haplotype approaches
- May not be common to different populations
- Is information lost? Simulation studies
35Day 5 Isolated Populations
- Advantages of Population Isolates
- Disadvantages of Population Isolates
- Ancestral Haplotype Reconstruction
36Advantages of Population Isolates
- Higher prevalence of some diseases
- More inbreeding
- More uniform genetic, environmental and cultural
background - Good genealogical records
- Easier to standardise phenotype definitions
- Wider intervals of LD
- Closer to HWE
37Disadvantages of Population Isolates
- Possibly fewer affected individuals
- Difficult to replicate studies
- Markers not polymorphic
- Genes mapped less important to rest of humanity
38Ancestral Haplotype Reconstruction
- A LD mapping method for samples from population
isolates - Quantifies chromosome sharing among individuals
affected with a common phenotype - Equipped to deal with aetiologic heterogeneity