False discoveries and models for gene discovery - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

False discoveries and models for gene discovery

Description:

PTD = Proportion of markers with True Detectable effect. ... PTD is essentially the power of the test (which is roughly approximate across ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 21
Provided by: slackSe
Category:

less

Transcript and Presenter's Notes

Title: False discoveries and models for gene discovery


1
False discoveries and models for gene discovery
  • Van den Oord EJCG, Sullivan PF (2003) Trends in
    Genetics 19.10537-542

2
Conclusions of Statistical Tests
  • Conclusion True state of relationship (unknown)
  • True False
  • Accept H1 No Error Type I Error
  • (Sensitivity) (False Positive)
  • Accept H0 Type II Error No Error
  • (False Negative) (Specificty)
  • H0 Null Hypothesis (No Association)
  • H1 Alternative Hypothesis (Association)

3
Problems with Multiple Testing
  • P-values tell you the frequency with which you
    would see an association of that size by chance
    alone.
  • You choose your p-value in the context of the
    test that you are performing (e.g. P lt 0.05)
  • If you test multiple loci (e.g. 20 SNPs) for
    association then you increase the chance of
    making a Type II error.
  • You will therefore expect to see one significant
    association by chance alone.
  • Traditionally apply Bonferonni correction to
    account for this (e.g. P lt 0.0025).
  • Overly conservative, does not account for
    correlation between p-values.

4
Other lines of Evidence
  • Biological plausability.
  • Large sample sizes (increases accuracy of test,
    and avoids problems of sampling bias).
  • Independent replication.
  • Animal models.
  • Sequencing to identify causual variants.
  • Functional tests.

5
Problems with this approach
  • Miss true effects.
  • Loci involved in complex diseases will have small
    effect (i.e. limited power to detect
    association).
  • Number of tests required increases if
    interactions are tested (further compunds
    multiple testing problem).
  • Lack of knowledge about biology of disease.
  • Limited resources.

6
Effect of Sample Size on Power
7
Effect of Significance Level on Power
8
Bonferroni Corrections
  • Select a priori probability for decalring
    significance (arbitarily a 0.05)
  • Count number of tests being performed (m)
  • Calculate corrected critical p-value Pk ....
  • Pk a / m
  • Results in very low power, reasonable for
    monogenic where there is only one true causative
    locus.
  • Overly conservative for complex diseases where
    each locus will have a small effect.
  • Greater implication of finding one locus
    affecting disease in monogenic disorders than in
    complex.
  • I'm not sure I agree with this statement, it
    could be potentially more expensive to follow up
    a false positive in a complex disease.

9
The Global Community
  • We do not work in isolation, there is a global
    community of scientists looking at human
    genetics.
  • Multiple testing of disease loci with different
    pheontypes also results in multiple testing.
  • 1000 different groups testing the same locus for
    association at a 0.05, each assuming m 1
    would result in 50 false positives by chance
    alone.

10
Alternatives Controlling FDR
  • Rather than attempting to eliminate all
    false-positives (Type I Errors), at the expense
    of power try and control it.
  • pk Corrected critical p-value.
  • p0 Proportion of tests for which H0 is true.
  • PTD Proportion of markers with True Detectable
    effect.
  • FDR False Discovery Rate (proportion of
    significant tests that are actually false).
  • Because the FDR is a proportion it is independent
    of sample size.

11
Controling FDR (continued)
  • You never know what p0 is, but it can be
    estimated from previous data, e.g. proportion of
    markers that replicate from independent studies.
  • PTD is essentially the power of the test (which
    is roughly approximate across loci, assuming good
    genotyping levels).
  • Correction is therefore made on the basis of
    three parameters (p0 and PTD and FDR) instead of
    two (a and m).
  • Low p0 results in low pk because if majority of
    loci have no effect need conservative correction.
  • FDR lt 0.1 results in large increase in sample
    size.

12
Stepwise Approaches
  • Type all markers in a subset of sample test for
    association.
  • Follow up subset of markers that show strongest
    association.
  • Eliminates wasteage of resources typing markers
    that aren't associated.

13
Stepwise Approaches (cont.)
  • Three scenarios....
  • Whole-genome LD scan, 5 x 106 markers with 50
    true effects (p0 0.9999).
  • LD fine-mapping with 200 markers and 1 true
    effect (p0 0.995).
  • Candidate gene approach (p0 0.75).
  • Qunatify through genotyping burden, defined as
    the average number of genotypes (i.e. individuals
    who are typed) for a marker.
  • For one-step approach simply the sample size.
  • Two step approach this is the number of
    individuals in the first subset plus the porotion
    of markers typed in the remaining individuals
    multiplied by the number of individuals.

14
Stepwise Approaches (cont.)
  • Iterative Procedure used to determine the optimal
    strategy (details on authors web-site, lots of
    maths for those interested).
  • Seek to minimise genotyping burden under each of
    the proposed p0, with a PTD of 0.8 and FDR of 0.1
    (i.e. 10 are false positives).
  • Basically you have small samples sizes and
    liberal p-values at stage 1 and large sample
    sizes and more conservative p-values at stage 2.
  • For scenarios where the majority of loci tested
    are not associated with disease (i.e. Scenario a)
    this offers a great reduction in genotyping.
  • As the number of tests for which the null
    hypothesis is true decreases (i.e. P0 ? 0) the
    advantage of a two-stage approach id diminished.

15
Influencing Factors
  • The proportion of loci for which the null
    hypothesis is true affects the efficency of a
    two-stage design. If high the two-stage greatly
    reduces genotyping burden.
  • The PTD (i.e. Power) has a large effect on
    genotyping burden. A reduction results in an
    increase in genotyping burden.
  • Changes in the FDR have a much less marked effect
    on genotyping burden, unless close to zero.

16
Scanning Genomic Regions
  • Two-stage strategy offers significant reduction
    in genotyping burden compared to one-step....
  • ....because it is inefficent to type markers that
    are unlikely to have an effect.
  • Useful analolgy of population based screening for
    breast-cancer.
  • Indicate that statistical guidance of marker
    selection can be supplemented using biological
    knowledge can enhnace this process.
  • Highlight the need for the reporting of all tests
    of association.
  • see http//geneticassociationdb.nih.gov

17
Candidate Genes
  • Two-stage strategy offers little advantage in
    candidate genes studies, because more loci are
    likely to have an effect, but....
  • FDR is an appealing method of controlling for
    multiple testing, however the parameter p0 is
    never known (although it can be estimtated).
  • Methods are available to correct for multiple
    testing using FDR and are implemented in Stata
    (see Newson (2003)).

18
Summary
  • Work is required to estimate number p0 (number of
    loci for which there is no association).
  • This will depend on the study design, but can be
    done empirically from published studies.
  • Possible to correct for multiple testing.
  • Concluded that two-stage approaches where FDR is
    minimised, rather than eliminated (i.e. Lander
    Kruglyak (1995)) is of greater efficency.
  • Suggest that current approaches to gene discovery
    are overly concerned with making false-positive
    associations.

19
Summary (cont.)
  • Currently we perform a lot of association studies
    in candidate genes, so the two stage approach
    proposed here is of limited use.
  • We already employ a two-stage approach where we
    seek to optimise the informativeness of markers
    genotyped in the whole cohort.
  • At present we use haplotype tagging, although the
    LD based selection proposed last week is a
    possible alternative.
  • Worthwhile using FDR to correct for multiple
    testing within candidate genes.
  • Investigation of FDR will be performed for GAW14.

20
References
  • In addition to the references in the article the
    following relate to FDR and multiple testing
    procedures...
  • Benajimi Y, et al (2001) Controlling the false
    discovery rate in behaviour genetics research.
    Behavoural Brain Research 125279-284
  • (see also http//www.math.tau.ac.il/ybenja/)
  • Newson R et al (2003) Multiple-test procedures
    and smile plots. The Stata Journal 3.2109-132
  • (see n\unit\The Stata Journal\)
  • Nyholt DR (2004) A simple correction for
    multiple testing for single-nucelotide
    polymorphisms in linkage disequilibrium with each
    other. Am J Hum Gen. 74 765-769.
  • (see http//genepi.qimr.edu.au/general/dealN/SNPS
    pD/)
  • Storey JD, Tibshirani (2003) Statstical
    significance in genomewide studies. PNAS
    100.169440-9445
Write a Comment
User Comments (0)
About PowerShow.com