Data quality/usability and population-based biobanks - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Data quality/usability and population-based biobanks

Description:

Statistical power of nested case-control studies. Expected event rates in UK Biobank ... FP6 Co-ordination Action (PHOEBE Promoting Harmonisation Of Epidemiological ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 45
Provided by: nes5
Category:

less

Transcript and Presenter's Notes

Title: Data quality/usability and population-based biobanks


1
Data quality/usability and population-based
biobanks
  • Paul Burton
  • Dept of Health Sciences
  • Dept of Genetics
  • University of Leicester

2
Structure of talk
  • Why does data quality/usability matter?
  • UK Biobank as an illustration
  • Statistical power of nested case-control studies
  • Expected event rates in UK Biobank
  • Biobank harmonisation
  • Conclusions

3
Why does data quality/usability matter?
4
Epidemiological analysisat its simplest
  • Odds ratio (OR) (120240)/(200100) 1.44
    1.04 2.0
  • May also adjust for a confounder
  • e.g. high saturated fat intake y/n
  • What is the impact of error in an outcome or an
    explanatory variable or in a confounder?

5
Systematic error
  • Some disease free smokers deny smoking
  • Odds ratio (OR) (120250)/(190100) 1.58

6
Random error
  • At random, 10 of subjects state their exposure
    incorrectly
  • Odds ratio (OR) (118236)/(204102) 1.34

7
The impact of errors
  • Systematic errors in outcome or explanatory
    variables ? systematic bias in either direction
  • True OR 2 ? estimated OR e.g. 1.5 or 2.7
  • Random errors in binary outcomes or any
    explanatory variables ? shrinkage bias
  • True OR 2 ? estimated OR e.g. 1.5
  • Random errors in confounding variables ?
    systematic bias in either direction
  • True OR 2 ? estimated OR e.g. 1.5 or 2.7

8
Errors in biobanks
  • Random errors
  • Loss of power is primary problem
  • Biobank sample sizes very large, so why is there
    a problem?

9
Errors in biobanks
  • Random errors
  • But why are biobank sample sizes so large?
  • NB Biobanks very large not nested case-control
    studies
  • Need to detect small relative risks (e.g. OR1.3)
  • Power generally limited (see later)
  • Small error effects catastrophic
  • Apparent causal effects easily created or
    destroyed

10
Errors in biobanks
  • Systematic errors
  • Small real effects a major issue again
  • Must understand data collection protocols, and
    must attempt to optimise those protocols
  • UK Biobank
  • P3G Observatory

11
What is UK Biobank?
12
Basic design features
  • A prospective cohort study
  • 500,000 adults across UK
  • Middle aged (40-69 years)
  • A population-based biobank
  • Not disease or exposure based
  • Recruitment via electronic GP lists
  • Broad spectrum not fully representative
  • Individuals not families
  • MRC, Wellcome Trust, DH, Scottish Executive
  • 61M

13
Basic design features
  • Longitudinal health tracking
  • Nested case-control studies
  • Long time-horizon
  • Owned by the Nation
  • Central Administration Manchester
  • PI Prof Rory Collins - Oxford
  • 6 collaborating groups (RCCs) of university
    scientists

14
Statistical powerand sample size
15
Focus on power of nestedcase-control analyses
  • Likely to be very common analyses
  • Power limiting

16
Issues that are often ignored in standard power
calculations
  • Multiple testing/low prior probability of
    association
  • Interactions
  • Unobserved frailty
  • Misclassification
  • Genotype
  • Environmental determinant
  • Case-control status
  • Subgroup analyses
  • Population substructure

17
Power calculations
  • Work with least powerful setting
  • Binary disease, binary genotype, binary
    environmental exposure
  • Logistic regression analysis interactions
    departure from a multiplicative model
  • Complexity (arbitrary but reasonable)

18
Summarise power using Minimum Detectable Odds
Ratios (MDORs) calculated by iterative
simulation
  • Estimate minimum ORs detectable with 80 power at
    stated level of statistical significance under
    specified scenario

19
Genetic main effects
20
Whole genome scan
  • Genetic main effect, plt10-7

21
Geneenvironment interaction
  • 20,000 cases

22
Summary rule of thumb
  • 80 power for genotype frequency 0.1, (allele
    frequency ? 0.05 under dominant model)
  • Genetic main effect ? 1.5, p10-4 ? 5,000 cases
  • Genetic main effect ? 1.3, p10-4 ? 10,000 cases
  • Genetic main effect ? 1.2, p10-4 ? 20,000 cases
  • Genetic main effect ? 1.4, p10-7 ? 10,000 cases
  • Genetic main effect ? 1.3, p10-7 ? 20,000 cases
  • GE interaction with environmental exposure
  • prevalance 0.2 ? 2.0, p10-4 ? 20,000
    cases

23
Effect of realistic data errors
24
Expected event ratesin UK Biobank
25
Taking account of
  • Age range at recruitment 40-69 years
  • Recruitment over 5 years
  • All cause mortality
  • Disease incidence (healthy cohort effect)
  • Migration overseas
  • Comprehensive withdrawal (max 1/500 p.a.)

26
No need to contact subjects
27
Smaller sample sizes
28
Interim conclusions
  • Having taken account of realistic bioclinical
    complexity, UK Biobank is just large enough to be
    of great value as a stand-alone research
    infrastructure
  • Data quality, in particular errors in outcome or
    explanatory variables, or in confounders is
    crucial
  • Its value will be greatly augmented if it proves
    possible to set up a coherent and scientifically
    harmonized international network of Biobanks and
    large cohort studies

29
Harmonising biobanks internationally
30
Why harmonise?
  • Basic aim is to enable and promote data pooling,
    in a manner that recognises and takes appropriate
    account of systematic differences between studies.

31
Why harmonise?
  • Investigate less common (but not rare) conditions
  • UKBB Ca stomach 2,500 cases in 29 years
  • 6 UKBB equivalents ? 10,000 cases in 20 years
  • Investigate smaller ORs
  • GME 1.5 ? 1.2 requires 5,000 ? 20,000
  • 4 UKBB equivalents
  • Analysis based on subsets homogeneous classes
    of phenotype, or e.g. by sex

32
Why harmonise?
  • Earlier analyses
  • UKBB Alzheimers disease, 10,000 cases in 18 yrs
  • 5 UKBB equivalents ? 9 years
  • Events at younger ages
  • Broad range of environmental exposures
  • Aim for 4-6 UKBB equivalents
  • 2M 3M recruits

33
Harmonisation initiatives
  • Public Population Program in Genomics (P3G)
  • Canada Europe
  • Tom Hudson, Bartha Knoppers, Leena Peltonen,
    Isabel Fortier ..
  • Population Biobanks
  • FP6 Co-ordination Action (PHOEBE Promoting
    Harmonisation Of Epidemiological Biobanks in
    Europe)
  • Camilla Stoltenberg, Paul Burton, Leena Peltonen,
    George Davey Smith ..

34
Harmonisation in the P3G Observatory(from Isabel
Fortier)
  • Description
  • Comparison
  • Harmonisation
  • Data quality crucial at every stage

35
Final conclusions
  • Power of individual biobanks is limited
  • Minimisation of measurement error is crucial
  • Harmonisation is crucial if we are to optimise
    the value of biobanks internationally
  • Harmonisation depends on a full understanding of
    all aspects of data quality

36
Extra slides
37
Rarer genotypes
  • Genetic main effects

38
Geneenvironment interaction
  • 10,000 cases

39
Hattersley AT, McCarthy MI. A question of
standards what makes a good genetic association
study? Lancet 2005 in press.
40
Summarise power using MDORs calculated by
iterative simulation
  • Want minimum ORs detectable with 80 power at
    stated level of statistical significance
  • 1. Guess starting values for ORs
  • 2. Simulate population under specified scenario
  • 3. Sample required number of cases and controls
  • 4. Analyse resultant case-control study in
    standard way
  • 5. Repeat 2,3,4 1,000 times
  • 6. Use empirical statistical power results from
    the 1,000 analyses to update ORs to new values
    expected to generate a power of 80
  • Repeat 2-6 till all ORs have 80 power

41
Taking account of
  • Age range at recruitment 40-69 years
  • Recruitment over 5 years
  • All cause mortality
  • Disease incidence (healthy cohort effect)
  • Migration overseas
  • Comprehensive withdrawal (max 1/500 p.a.)
  • Partial withdrawal (c.f. 1958 Birth Cohort)

42
(No Transcript)
43
Necessary to contact subjects
44
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com