Detecting and correcting for population structure in casecontrol studies - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Detecting and correcting for population structure in casecontrol studies

Description:

Cases preferentially ascertained from specific subpopulations. ... Assume that the sampled individuals have been ascertained from K discrete subpopulations. ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 28
Provided by: amo131
Category:

less

Transcript and Presenter's Notes

Title: Detecting and correcting for population structure in casecontrol studies


1
Detecting and correcting for population structure
in case-control studies
2
Population structure in genetic association
studies
  • Population consists of underlying subpopulations.
  • Disease prevalence different between
    subpopulations.
  • Cases preferentially ascertained from specific
    subpopulations.
  • False positive evidence of association will occur
    at genetic markers that differ in genotype
    frequencies between the subpopulations.
  • Traditionally, human geneticists have been
    skeptical of case-control studies for this reason.

CASES
CONTROLS
3
Example
  • Population consists of two equally frequent
    isolated sub-populations.
  • In the population overall Pr(Disease MM) lt
    Pr(Disease) Pr(MM)
  • If we ascertain individuals without regard to
    subpopulation, cases tend to be selected from
    subpopulation 1, which has a low frequency of the
    MM marker genotype.

4
Disease prevalence and marker allele frequencies
vary across populations
5
Matching
  • One solution to the problem is to allow for
    structure at the design stage, by matching cases
    and controls for ethnic group, for example.
  • When a case is selected from a given ethnic
    group, a matched control is selected from the
    same group.
  • Matched case-control studies require a matched
    analysis.
  • However, there may be fine-scale structure or
    within ethnic groups or population admixture that
    cannot be accounted for by matching.
  • Apparent association between SNPs and type 2
    diabetes in Pima Indians.
  • Type 2 diabetes occurs with greater prevalence in
    Caucasian individuals.
  • Association due to population admixture cases
    tended to have a greater proportion of Caucasian
    ancestry, and allele frequencies vary between the
    ancestral populations.

6
Solutions to the problem
  • We can eliminate the problem of population
    structure by collecting family data.
  • Family-based association designs ascertain
    affected cases and their parents.
  • Form internal controls from alleles not
    transmitted from the parents to the child,
    effectively matching for ancestry.
  • Less powerful since two parents are required to
    form a single matched control.
  • Parental data may not always be available, e.g.
    for late-age onset diseases.
  • For unrelated samples of cases and controls, we
    can make use of genotype data across the genome
    to make inferences about and/or adjust for
    population ancestry.
  • In the presence of structure, there will be many
    more (false) positive signals of association than
    we would expect by chance.

7
Test for mis-matching of case-control samples
  • If an apparent association is due to population
    structure, there should be many such associations
    across the genome (because allele frequencies
    vary between populations).
  • Pritchard and Rosenberg (1999) suggest a strategy
    of typing additional random markers scattered
    throughout the genome to test for the presence of
    structure.
  • Type the sample at L unlinked markers and test
    for disease-marker association at each marker.
    Test statistic for structure
  • where Xi2 is the usual Cochran-Armitage trend
    test statistic for the ith marker.
  • Under the null hypothesis of no population
    substructure
  • We expect no association between disease and
    random markers.
  • X2SUB has an approximate chi-squared distribution
    with L degrees of freedom.

8
Comments
  • This test looks for an average difference in
    ancestry between the case and control samples,
    not for population structure per se.
  • Do not assign individuals to specific
    populations, but provides an overall test of
    substructure.
  • If you have a large number of candidate loci, you
    may choose to use the candidate data together to
    test for mismatching.
  • How many markers are required to detect
    structure?
  • Depends on the extent of structure in the
    population.
  • Pritchard and Rosenberg (1999) suggest 20
    microsatellite markers or 30 SNPs to detect
    moderate stratification.
  • As the price of genotyping decreases it becomes
    more feasible to use a large number of markers to
    test for structure, and hence identify fine-scale
    stratification.

9
Genomic control
  • Devlin and Roeder (1999) used theoretical
    arguments to propose that with population
    structure, the distribution of Cochran-Armitage
    trend tests, genome-wide, is inflated by a
    constant multiplicative factor ?.
  • We can estimate the multiplicative inflation
    factor using the statistic ?
    median(Xi2)/0.465.
  • Inflation factor ? gt 1 indicates population
    structure and/or genotyping error.
  • We can carry out an adjusted test of association
    that takes account of any mismatching of
    cases/controls at any SNP using the statistic
    Xi2/ ?.

True hits?
Structure?
Inflation factor ? 1.11
10
Comments
  • Advantages.
  • Easy to implement genomic control in whole genome
    association studies.
  • Requires relatively small numbers of markers
    (minimum of around 50 SNPs).
  • Can be extended to the analysis of quantitative
    traits and to genotype frequencies obtained from
    pooled DNA.
  • Disadvantages.
  • Limited to relatively simple tests of
    association, and is less robust to haplotype
    tests, for example.
  • There will be a loss in power if there are
    different genetic effects acting in the different
    subpopulations.

11
Structured association
  • A two-phase approach
  • Use genotype data from unlinked genetic markers
    to learn about population structure, and to infer
    the ancestry of individuals in the sample
    (Pritchard et al. 2000a).
  • Makes use of Bayesian MCMC technology to
    approximate the posterior probability that any
    individual comes from a specific subpopulation,
    or the proportion of ancestry of any individual
    from each subpopulation.
  • Implemented in the STRUCTURE software package.
  • Test for association at candidate loci taking
    account of the ancestry of cases and controls
    (Pritchard et al. 2000b)
  • Implemented in the STRAT software package.

12
STRUCTURE
  • Consider a sample of unrelated individuals typed
    at many unlinked markers (i.e. not in LD),
    yielding genotypes G.
  • Assume that the sampled individuals have been
    ascertained from K discrete subpopulations.
  • Goal is to estimate allele frequencies, pK,
    within each subpopulation (under the assumption
    of Hardy-Weinberg equilibrium) and the
    assignments, Z, of each individual to each
    subpopulation.
  • Conditional on allele frequencies, pK, we can
    write down an expression for the posterior
    distribution of subpopulation assignments for the
    ith individual using Bayes theorem
  • where Pr(ziJ) is the prior probability of
    assignment to subpopulation J, typically taken to
    be 1/K.

13
  • Similarly, conditional on the subpopulation
    assignments of each individual, Z, we can write
    down an expression for the posterior distribution
    of allele frequencies, Pr(pKZ,G), given a
    pre-specified prior density Pr(pK).
  • STRUCTURE uses Bayesian Markov chain Monte Carlo
    (MCMC) techniques to sample, in turn, from the
    densities Pr(pKZ,G) and Pr(ZpK,G). Algorithm
    run for an initial burn-in period to allow for
    convergence. In the subsequent sampling period,
    sampled values of the population allele
    frequencies and subpopulation assignments for
    each individual are recorded to approximate the
    marginal posterior distributions of pK and Z.
  • One drawback is the choice of K, the number of
    subpopulations, which will not generally be known
    in advance. Ad-hoc assessment of best choice of
    K by comparison of model fit (likelihoods).
  • The basic structure model assumes that each
    individual belongs to just one discrete
    subpopulation. STRUCTURE also allows for admixed
    populations, where the subpopulation assignments,
    Z, are replaced by vectors of ancestry, Q, where
    qik denotes the proportion of the ancestry of the
    ith individual that comes from subpopulation k.

14
Example
Study of 1056 individuals from 52 populations
using 377 autosomal microsatellite markers.
Rosenberg et al. (2002).
15
Example (continued)
Each individual is a thin vertical line that is
partitioned into K coloured segments according to
its membership coefficients in K clusters.
16
Example (continued)
Inferred ancestry within geographic regions. 93
of variability in allele frequencies occurs
within populations.
17
STRAT
  • In a structured population, the null hypothesis
    of interest is of no association of disease
    phenotype with marker genotypes within each
    subpopulation. The alternative hypothesis is
    that genotype frequencies vary with subpopulation
    and disease phenotype.
  • Allows for the possibility of different genetic
    effects in each subpopulation (allelic or genetic
    heterogeneity).
  • Conditional on the ancestries (Q) estimated by
    STRUCTURE, we can construct a test statistic by
    computing the likelihood ratio under the two
    hypotheses
  • Allele frequencies (p0 and p1) in each
    subpopulation under the two hypotheses estimated
    via maximum likelihood via implementation of the
    expectation-maximisation algorithm.
  • Simulation used to assess the significance of ?.

18
Comments
  • Advantages.
  • Versatility can allow for discrete or admixed
    subpopulations, and can utilise SNPs and
    microsatellites.
  • Provides a general framework for allowing for
    structure in association tests could be extended
    to multi-locus or haplotype tests.
  • Provides detailed information about population
    structure.
  • Disadvantages.
  • Use and interpretation requires some care (for
    example choice of number of subpopulations).
  • Requires more markers than genomic control.
  • Too computationally intensive for whole genome
    data requires selection of a subset of unlinked
    markers.

19
Multivariate techniques
  • Principal components analysis (PCA) has become a
    standard tool in genetics to study geographic
    variation in allele frequencies.
  • PCA is used to infer continuous axes of genetic
    variation (eigenvectors) that reduce the data to
    a small number of dimensions, whilst describing
    as much of the variability between individuals as
    possible.
  • Patterson et al. (2006) demonstrate that PCA can
    be used to identify population structure in large
    scale data sets with hundreds of thousands of
    genetic markers, and can allow for LD between
    loci.
  • Observed genotypes and phenotypes can be
    continuously adjusted by amounts attributable to
    ancestry along each axis (Price et al. 2006),
    effectively matching cases and controls.
  • Computing association statistics using the
    ancestry adjusted genotypes and phenotypes will
    take account of population structure.
  • Implemented in the EIGENSTRAT software package.

20
Example 1. Three African populations.
Eigenvector 1 separates out the San population.
Eigenvector 2 separates the Bantu and Mandenka
populations, although the structure is less
obvious.
21
Example 2. Three East Asian populations. The
first two eigenvectors (together) separate the
Japanese from the Chinese and Northern Thai. The
large dispersal of the Thai population along a
line where the Chinese are at an extreme suggests
some gene flow of a Chinese related population
into Thailand
22
Simulation study
  • Price et al. (2006) generated data at 100,000
    random SNPs for 500 cases and 500 controls from
    two different subpopulations 60 of cases and
    40 of controls came from subpopulation 1.
  • Three types of SNPs simulated
  • Random SNPs with no association to disease, but
    allele frequency differences between the two
    subpopulations at the level we would expect
    between European populations.
  • Differentiated SNPs with no association to
    disease, but allele frequencies of 0.8 in
    subpopulation 1 and 0.2 in subpopulation 2.
  • Causal SNPs associated with disease, with allele
    frequencies generated in the same way as random
    SNPs, and a multiplicative relative risk of 1.5
    for the causal allele.
  • Cochran-Armitage trend test performed correction
    for structure made using genomic control and
    EIGENSTRAT.

23
Simulation study results
  • Proportion of SNPs in each category yielding
    significant evidence of association with plt10-4.
  • Type I error rate of random SNPs correct for
    genomic control and EIGENSTRAT, but slightly
    inflated without correction due to population
    structure.
  • Type I error rate of differentiated SNPs correct
    only for EIGENSTRAT.
  • Minimal reduction in power to detect causal SNPs
    for EIGENSTRAT compared to uncorrected test.

24
Multi-dimensional scaling
  • We can measure the similarity between pairs of
    individuals by means of their identity by state
    (IBS) across the genome. Over M markers, the IBS
    between the ith and jth individuals is given by
  • where Gik denotes the number of minor alleles
    (0, 1 or 2) carried by the ith individual at SNP
    k.
  • Multi-dimensional scaling aims to detect axes of
    variation between individuals that maximise the
    disimilarities between them.
  • In this context, very closely related to
    principal components analysis, and yields similar
    results to EIGENSTRAT.
  • Incorporate eigenvectors as covariates in a
    logistic regression model to eliminate the
    effects of population structure.

25
References
  • Devlin B, Roeder K (1999). Genomic control for
    association studies. Biometrics 55997-1004.
  • Patterson N, Price AL, Reich D (2006). Population
    structure and eigenanalysis. PLoS Genetics 2
    2074-2093.
  • Price AL, Patterson NJ, Plenge RM, Weinblatt ME,
    Shadick NA, Reich D (2006). Principal components
    analysis corrects for stratification in
    genome-wide association studies. Nature Genetics
    38 904-909.
  • Pritchard JK, Rosenberg N (1999). Use of unlinked
    genetic markers to detect population
    stratification in association studies. American
    Journal of Human Genetics 65 220-228.
  • Pritchard JK, Stephens M, Donnelly P (2000a).
    Inference of population structure using
    multilocus genotype data. Genetics 155 945-959.
  • Pritchard JK, Stephens M, Rosenberg N, Donnelly
    P (2000b). Association mapping in structured
    populations. American Journal of Human Genetics
    67170-181.
  • Rosenberg N, Pritchard JK, Weber JL, Cann HM,
    Kidd KK, Zhivotovsky LA, Feldman MW (2002).
    Genetic structure of human populations. Science
    298 2981-2985.

26
Comments
  • Advantages.
  • Multivariate techniques are computationally
    efficient and can be applied in the context of
    whole genome association studies.
  • The axes of variation can be interpreted in terms
    of population structure, and with large numbers
    of SNPs can clearly differentiate between even
    relatively similar subpopulations and admixed
    groups.
  • Disadvantages.
  • Some care is needed in interpretation of the
    eigenvectors (for example may indicate extended
    regions of LD, rather than population structure).

27
Summary
  • Population structure can lead to spurious
    associations if disease prevalence and allele
    frequencies vary between subpopulations.
  • We can use information from markers scattered
    throughout the genome to test for the presence of
    structure, identify groups of individuals with
    similar ancestry, and to correct association
    tests for mismatching of cases and controls.
  • STRUCTURE provides a very detailed view of
    population structure, but is limited to a few
    thousand markers because of computational
    constraints.
  • Multivariate statistical techniques are less
    computationally intensive, and can be used to
    simply correct association tests for structure in
    a whole genome association study.
Write a Comment
User Comments (0)
About PowerShow.com