Title: Detecting and correcting for population structure in casecontrol studies
1Detecting and correcting for population structure
in case-control studies
2Population structure in genetic association
studies
- Population consists of underlying subpopulations.
- Disease prevalence different between
subpopulations. - Cases preferentially ascertained from specific
subpopulations. - False positive evidence of association will occur
at genetic markers that differ in genotype
frequencies between the subpopulations. - Traditionally, human geneticists have been
skeptical of case-control studies for this reason.
CASES
CONTROLS
3Example
- Population consists of two equally frequent
isolated sub-populations. - In the population overall Pr(Disease MM) lt
Pr(Disease) Pr(MM) - If we ascertain individuals without regard to
subpopulation, cases tend to be selected from
subpopulation 1, which has a low frequency of the
MM marker genotype.
4Disease prevalence and marker allele frequencies
vary across populations
5Matching
- One solution to the problem is to allow for
structure at the design stage, by matching cases
and controls for ethnic group, for example. - When a case is selected from a given ethnic
group, a matched control is selected from the
same group. - Matched case-control studies require a matched
analysis. - However, there may be fine-scale structure or
within ethnic groups or population admixture that
cannot be accounted for by matching. - Apparent association between SNPs and type 2
diabetes in Pima Indians. - Type 2 diabetes occurs with greater prevalence in
Caucasian individuals. - Association due to population admixture cases
tended to have a greater proportion of Caucasian
ancestry, and allele frequencies vary between the
ancestral populations.
6Solutions to the problem
- We can eliminate the problem of population
structure by collecting family data. - Family-based association designs ascertain
affected cases and their parents. - Form internal controls from alleles not
transmitted from the parents to the child,
effectively matching for ancestry. - Less powerful since two parents are required to
form a single matched control. - Parental data may not always be available, e.g.
for late-age onset diseases. - For unrelated samples of cases and controls, we
can make use of genotype data across the genome
to make inferences about and/or adjust for
population ancestry. - In the presence of structure, there will be many
more (false) positive signals of association than
we would expect by chance.
7Test for mis-matching of case-control samples
- If an apparent association is due to population
structure, there should be many such associations
across the genome (because allele frequencies
vary between populations). - Pritchard and Rosenberg (1999) suggest a strategy
of typing additional random markers scattered
throughout the genome to test for the presence of
structure. - Type the sample at L unlinked markers and test
for disease-marker association at each marker.
Test statistic for structure - where Xi2 is the usual Cochran-Armitage trend
test statistic for the ith marker. - Under the null hypothesis of no population
substructure - We expect no association between disease and
random markers. - X2SUB has an approximate chi-squared distribution
with L degrees of freedom.
8Comments
- This test looks for an average difference in
ancestry between the case and control samples,
not for population structure per se. - Do not assign individuals to specific
populations, but provides an overall test of
substructure. - If you have a large number of candidate loci, you
may choose to use the candidate data together to
test for mismatching. - How many markers are required to detect
structure? - Depends on the extent of structure in the
population. - Pritchard and Rosenberg (1999) suggest 20
microsatellite markers or 30 SNPs to detect
moderate stratification. - As the price of genotyping decreases it becomes
more feasible to use a large number of markers to
test for structure, and hence identify fine-scale
stratification.
9Genomic control
- Devlin and Roeder (1999) used theoretical
arguments to propose that with population
structure, the distribution of Cochran-Armitage
trend tests, genome-wide, is inflated by a
constant multiplicative factor ?. - We can estimate the multiplicative inflation
factor using the statistic ?
median(Xi2)/0.465. - Inflation factor ? gt 1 indicates population
structure and/or genotyping error. - We can carry out an adjusted test of association
that takes account of any mismatching of
cases/controls at any SNP using the statistic
Xi2/ ?.
True hits?
Structure?
Inflation factor ? 1.11
10Comments
- Advantages.
- Easy to implement genomic control in whole genome
association studies. - Requires relatively small numbers of markers
(minimum of around 50 SNPs). - Can be extended to the analysis of quantitative
traits and to genotype frequencies obtained from
pooled DNA. - Disadvantages.
- Limited to relatively simple tests of
association, and is less robust to haplotype
tests, for example. - There will be a loss in power if there are
different genetic effects acting in the different
subpopulations.
11Structured association
- A two-phase approach
- Use genotype data from unlinked genetic markers
to learn about population structure, and to infer
the ancestry of individuals in the sample
(Pritchard et al. 2000a). - Makes use of Bayesian MCMC technology to
approximate the posterior probability that any
individual comes from a specific subpopulation,
or the proportion of ancestry of any individual
from each subpopulation. - Implemented in the STRUCTURE software package.
- Test for association at candidate loci taking
account of the ancestry of cases and controls
(Pritchard et al. 2000b) - Implemented in the STRAT software package.
12STRUCTURE
- Consider a sample of unrelated individuals typed
at many unlinked markers (i.e. not in LD),
yielding genotypes G. - Assume that the sampled individuals have been
ascertained from K discrete subpopulations. - Goal is to estimate allele frequencies, pK,
within each subpopulation (under the assumption
of Hardy-Weinberg equilibrium) and the
assignments, Z, of each individual to each
subpopulation. - Conditional on allele frequencies, pK, we can
write down an expression for the posterior
distribution of subpopulation assignments for the
ith individual using Bayes theorem - where Pr(ziJ) is the prior probability of
assignment to subpopulation J, typically taken to
be 1/K.
13- Similarly, conditional on the subpopulation
assignments of each individual, Z, we can write
down an expression for the posterior distribution
of allele frequencies, Pr(pKZ,G), given a
pre-specified prior density Pr(pK). - STRUCTURE uses Bayesian Markov chain Monte Carlo
(MCMC) techniques to sample, in turn, from the
densities Pr(pKZ,G) and Pr(ZpK,G). Algorithm
run for an initial burn-in period to allow for
convergence. In the subsequent sampling period,
sampled values of the population allele
frequencies and subpopulation assignments for
each individual are recorded to approximate the
marginal posterior distributions of pK and Z. - One drawback is the choice of K, the number of
subpopulations, which will not generally be known
in advance. Ad-hoc assessment of best choice of
K by comparison of model fit (likelihoods). - The basic structure model assumes that each
individual belongs to just one discrete
subpopulation. STRUCTURE also allows for admixed
populations, where the subpopulation assignments,
Z, are replaced by vectors of ancestry, Q, where
qik denotes the proportion of the ancestry of the
ith individual that comes from subpopulation k.
14Example
Study of 1056 individuals from 52 populations
using 377 autosomal microsatellite markers.
Rosenberg et al. (2002).
15Example (continued)
Each individual is a thin vertical line that is
partitioned into K coloured segments according to
its membership coefficients in K clusters.
16Example (continued)
Inferred ancestry within geographic regions. 93
of variability in allele frequencies occurs
within populations.
17STRAT
- In a structured population, the null hypothesis
of interest is of no association of disease
phenotype with marker genotypes within each
subpopulation. The alternative hypothesis is
that genotype frequencies vary with subpopulation
and disease phenotype. - Allows for the possibility of different genetic
effects in each subpopulation (allelic or genetic
heterogeneity). - Conditional on the ancestries (Q) estimated by
STRUCTURE, we can construct a test statistic by
computing the likelihood ratio under the two
hypotheses - Allele frequencies (p0 and p1) in each
subpopulation under the two hypotheses estimated
via maximum likelihood via implementation of the
expectation-maximisation algorithm. - Simulation used to assess the significance of ?.
18Comments
- Advantages.
- Versatility can allow for discrete or admixed
subpopulations, and can utilise SNPs and
microsatellites. - Provides a general framework for allowing for
structure in association tests could be extended
to multi-locus or haplotype tests. - Provides detailed information about population
structure. - Disadvantages.
- Use and interpretation requires some care (for
example choice of number of subpopulations). - Requires more markers than genomic control.
- Too computationally intensive for whole genome
data requires selection of a subset of unlinked
markers.
19Multivariate techniques
- Principal components analysis (PCA) has become a
standard tool in genetics to study geographic
variation in allele frequencies. - PCA is used to infer continuous axes of genetic
variation (eigenvectors) that reduce the data to
a small number of dimensions, whilst describing
as much of the variability between individuals as
possible. - Patterson et al. (2006) demonstrate that PCA can
be used to identify population structure in large
scale data sets with hundreds of thousands of
genetic markers, and can allow for LD between
loci. - Observed genotypes and phenotypes can be
continuously adjusted by amounts attributable to
ancestry along each axis (Price et al. 2006),
effectively matching cases and controls. - Computing association statistics using the
ancestry adjusted genotypes and phenotypes will
take account of population structure. - Implemented in the EIGENSTRAT software package.
20Example 1. Three African populations.
Eigenvector 1 separates out the San population.
Eigenvector 2 separates the Bantu and Mandenka
populations, although the structure is less
obvious.
21Example 2. Three East Asian populations. The
first two eigenvectors (together) separate the
Japanese from the Chinese and Northern Thai. The
large dispersal of the Thai population along a
line where the Chinese are at an extreme suggests
some gene flow of a Chinese related population
into Thailand
22Simulation study
- Price et al. (2006) generated data at 100,000
random SNPs for 500 cases and 500 controls from
two different subpopulations 60 of cases and
40 of controls came from subpopulation 1. - Three types of SNPs simulated
- Random SNPs with no association to disease, but
allele frequency differences between the two
subpopulations at the level we would expect
between European populations. - Differentiated SNPs with no association to
disease, but allele frequencies of 0.8 in
subpopulation 1 and 0.2 in subpopulation 2. - Causal SNPs associated with disease, with allele
frequencies generated in the same way as random
SNPs, and a multiplicative relative risk of 1.5
for the causal allele. - Cochran-Armitage trend test performed correction
for structure made using genomic control and
EIGENSTRAT.
23Simulation study results
- Proportion of SNPs in each category yielding
significant evidence of association with plt10-4. - Type I error rate of random SNPs correct for
genomic control and EIGENSTRAT, but slightly
inflated without correction due to population
structure. - Type I error rate of differentiated SNPs correct
only for EIGENSTRAT. - Minimal reduction in power to detect causal SNPs
for EIGENSTRAT compared to uncorrected test.
24Multi-dimensional scaling
- We can measure the similarity between pairs of
individuals by means of their identity by state
(IBS) across the genome. Over M markers, the IBS
between the ith and jth individuals is given by - where Gik denotes the number of minor alleles
(0, 1 or 2) carried by the ith individual at SNP
k. - Multi-dimensional scaling aims to detect axes of
variation between individuals that maximise the
disimilarities between them. - In this context, very closely related to
principal components analysis, and yields similar
results to EIGENSTRAT. - Incorporate eigenvectors as covariates in a
logistic regression model to eliminate the
effects of population structure.
25References
- Devlin B, Roeder K (1999). Genomic control for
association studies. Biometrics 55997-1004. - Patterson N, Price AL, Reich D (2006). Population
structure and eigenanalysis. PLoS Genetics 2
2074-2093. - Price AL, Patterson NJ, Plenge RM, Weinblatt ME,
Shadick NA, Reich D (2006). Principal components
analysis corrects for stratification in
genome-wide association studies. Nature Genetics
38 904-909. - Pritchard JK, Rosenberg N (1999). Use of unlinked
genetic markers to detect population
stratification in association studies. American
Journal of Human Genetics 65 220-228. - Pritchard JK, Stephens M, Donnelly P (2000a).
Inference of population structure using
multilocus genotype data. Genetics 155 945-959. - Pritchard JK, Stephens M, Rosenberg N, Donnelly
P (2000b). Association mapping in structured
populations. American Journal of Human Genetics
67170-181. - Rosenberg N, Pritchard JK, Weber JL, Cann HM,
Kidd KK, Zhivotovsky LA, Feldman MW (2002).
Genetic structure of human populations. Science
298 2981-2985.
26Comments
- Advantages.
- Multivariate techniques are computationally
efficient and can be applied in the context of
whole genome association studies. - The axes of variation can be interpreted in terms
of population structure, and with large numbers
of SNPs can clearly differentiate between even
relatively similar subpopulations and admixed
groups. - Disadvantages.
- Some care is needed in interpretation of the
eigenvectors (for example may indicate extended
regions of LD, rather than population structure).
27Summary
- Population structure can lead to spurious
associations if disease prevalence and allele
frequencies vary between subpopulations. - We can use information from markers scattered
throughout the genome to test for the presence of
structure, identify groups of individuals with
similar ancestry, and to correct association
tests for mismatching of cases and controls. - STRUCTURE provides a very detailed view of
population structure, but is limited to a few
thousand markers because of computational
constraints. - Multivariate statistical techniques are less
computationally intensive, and can be used to
simply correct association tests for structure in
a whole genome association study.