Title: Data quality/usability and population-based biobanks
1Data quality/usability and population-based
biobanks
- Paul Burton
- Dept of Health Sciences
- Dept of Genetics
- University of Leicester
2Structure of talk
- Why does data quality/usability matter?
- UK Biobank as an illustration
- Statistical power of nested case-control studies
- Expected event rates in UK Biobank
- Biobank harmonisation
- Conclusions
3Why does data quality/usability matter?
4Epidemiological analysisat its simplest
- Odds ratio (OR) (120240)/(200100) 1.44
1.04 2.0 - May also adjust for a confounder
- e.g. high saturated fat intake y/n
- What is the impact of error in an outcome or an
explanatory variable or in a confounder?
5Systematic error
- Some disease free smokers deny smoking
- Odds ratio (OR) (120250)/(190100) 1.58
6Random error
- At random, 10 of subjects state their exposure
incorrectly - Odds ratio (OR) (118236)/(204102) 1.34
7The impact of errors
- Systematic errors in outcome or explanatory
variables ? systematic bias in either direction - True OR 2 ? estimated OR e.g. 1.5 or 2.7
- Random errors in binary outcomes or any
explanatory variables ? shrinkage bias - True OR 2 ? estimated OR e.g. 1.5
- Random errors in confounding variables ?
systematic bias in either direction - True OR 2 ? estimated OR e.g. 1.5 or 2.7
8Errors in biobanks
- Random errors
- Loss of power is primary problem
- Biobank sample sizes very large, so why is there
a problem?
9Errors in biobanks
- Random errors
- But why are biobank sample sizes so large?
- NB Biobanks very large not nested case-control
studies - Need to detect small relative risks (e.g. OR1.3)
- Power generally limited (see later)
- Small error effects catastrophic
- Apparent causal effects easily created or
destroyed
10Errors in biobanks
- Systematic errors
- Small real effects a major issue again
- Must understand data collection protocols, and
must attempt to optimise those protocols - UK Biobank
- P3G Observatory
11What is UK Biobank?
12Basic design features
- A prospective cohort study
- 500,000 adults across UK
- Middle aged (40-69 years)
- A population-based biobank
- Not disease or exposure based
- Recruitment via electronic GP lists
- Broad spectrum not fully representative
- Individuals not families
- MRC, Wellcome Trust, DH, Scottish Executive
- 61M
13Basic design features
- Longitudinal health tracking
- Nested case-control studies
- Long time-horizon
- Owned by the Nation
- Central Administration Manchester
- PI Prof Rory Collins - Oxford
- 6 collaborating groups (RCCs) of university
scientists
14Statistical powerand sample size
15Focus on power of nestedcase-control analyses
- Likely to be very common analyses
- Power limiting
16Issues that are often ignored in standard power
calculations
- Multiple testing/low prior probability of
association - Interactions
- Unobserved frailty
- Misclassification
- Genotype
- Environmental determinant
- Case-control status
- Subgroup analyses
- Population substructure
17Power calculations
- Work with least powerful setting
- Binary disease, binary genotype, binary
environmental exposure - Logistic regression analysis interactions
departure from a multiplicative model - Complexity (arbitrary but reasonable)
18Summarise power using Minimum Detectable Odds
Ratios (MDORs) calculated by iterative
simulation
- Estimate minimum ORs detectable with 80 power at
stated level of statistical significance under
specified scenario
19Genetic main effects
20Whole genome scan
- Genetic main effect, plt10-7
21Geneenvironment interaction
22Summary rule of thumb
- 80 power for genotype frequency 0.1, (allele
frequency ? 0.05 under dominant model) - Genetic main effect ? 1.5, p10-4 ? 5,000 cases
- Genetic main effect ? 1.3, p10-4 ? 10,000 cases
- Genetic main effect ? 1.2, p10-4 ? 20,000 cases
- Genetic main effect ? 1.4, p10-7 ? 10,000 cases
- Genetic main effect ? 1.3, p10-7 ? 20,000 cases
- GE interaction with environmental exposure
- prevalance 0.2 ? 2.0, p10-4 ? 20,000
cases
23Effect of realistic data errors
24Expected event ratesin UK Biobank
25Taking account of
- Age range at recruitment 40-69 years
- Recruitment over 5 years
- All cause mortality
- Disease incidence (healthy cohort effect)
- Migration overseas
- Comprehensive withdrawal (max 1/500 p.a.)
26No need to contact subjects
27Smaller sample sizes
28Interim conclusions
- Having taken account of realistic bioclinical
complexity, UK Biobank is just large enough to be
of great value as a stand-alone research
infrastructure - Data quality, in particular errors in outcome or
explanatory variables, or in confounders is
crucial - Its value will be greatly augmented if it proves
possible to set up a coherent and scientifically
harmonized international network of Biobanks and
large cohort studies
29Harmonising biobanks internationally
30Why harmonise?
- Basic aim is to enable and promote data pooling,
in a manner that recognises and takes appropriate
account of systematic differences between studies.
31Why harmonise?
- Investigate less common (but not rare) conditions
- UKBB Ca stomach 2,500 cases in 29 years
- 6 UKBB equivalents ? 10,000 cases in 20 years
- Investigate smaller ORs
- GME 1.5 ? 1.2 requires 5,000 ? 20,000
- 4 UKBB equivalents
- Analysis based on subsets homogeneous classes
of phenotype, or e.g. by sex
32Why harmonise?
- Earlier analyses
- UKBB Alzheimers disease, 10,000 cases in 18 yrs
- 5 UKBB equivalents ? 9 years
- Events at younger ages
- Broad range of environmental exposures
- Aim for 4-6 UKBB equivalents
- 2M 3M recruits
33Harmonisation initiatives
- Public Population Program in Genomics (P3G)
- Canada Europe
- Tom Hudson, Bartha Knoppers, Leena Peltonen,
Isabel Fortier .. - Population Biobanks
- FP6 Co-ordination Action (PHOEBE Promoting
Harmonisation Of Epidemiological Biobanks in
Europe) - Camilla Stoltenberg, Paul Burton, Leena Peltonen,
George Davey Smith ..
34Harmonisation in the P3G Observatory(from Isabel
Fortier)
- Description
- Comparison
- Harmonisation
- Data quality crucial at every stage
35Final conclusions
- Power of individual biobanks is limited
- Minimisation of measurement error is crucial
- Harmonisation is crucial if we are to optimise
the value of biobanks internationally - Harmonisation depends on a full understanding of
all aspects of data quality
36Extra slides
37Rarer genotypes
38Geneenvironment interaction
39Hattersley AT, McCarthy MI. A question of
standards what makes a good genetic association
study? Lancet 2005 in press.
40Summarise power using MDORs calculated by
iterative simulation
- Want minimum ORs detectable with 80 power at
stated level of statistical significance - 1. Guess starting values for ORs
- 2. Simulate population under specified scenario
- 3. Sample required number of cases and controls
- 4. Analyse resultant case-control study in
standard way - 5. Repeat 2,3,4 1,000 times
- 6. Use empirical statistical power results from
the 1,000 analyses to update ORs to new values
expected to generate a power of 80 - Repeat 2-6 till all ORs have 80 power
41Taking account of
- Age range at recruitment 40-69 years
- Recruitment over 5 years
- All cause mortality
- Disease incidence (healthy cohort effect)
- Migration overseas
- Comprehensive withdrawal (max 1/500 p.a.)
- Partial withdrawal (c.f. 1958 Birth Cohort)
42(No Transcript)
43Necessary to contact subjects
44(No Transcript)