Title: Size matters: the value of large scale epidemiology
1Size matters the value of large scale
epidemiology
- Paul Burton
- Professor of Genetic Epidemiology
- University of Leicester
- P³G Consortium
- PHOEBE
2A daunting task!
- Need for extensive, valid information
- Developments in biotechnology, IT
- Pre-morbid and longitudinal life-style/environment
relevant - Bioclinical complexity ? low statistical power!!
3Large scale genetic epidemiology
- Focus on the aetiology of complex diseases
- Common disease common variant hypothesis
- a shift in paradigm from linkage to association
- BUT serious failure to identify associations
that can consistently be replicated
4Hattersley AT, McCarthy MI. Lancet
20053661315-1323 Examples of some polymorphisms
or haplotypes that have shown consistent
association with complex disease
5Why has replicationproved to be so difficult?
- Poorly designed studies
- e.g. wrong controls, family v non-family designs
- Poorly conducted analyses and meta-analyses
- e.g. use of inefficient or inconsistent methods
failure to take proper account of extreme
multiple testing publication and/or reporting
bias - Inconsistent definitions of outcome or exposure
- e.g. what do we mean by asthma?
- Poor methods of assessment
- e.g. bad choice of SNP genotyping platform
6Why has replicationproved to be so difficult?
- Heterogeneity
- e.g. stroke encompasses important
subcategories phenocopies pleiotropy - Population substructure
- Latent stratification and admixture pertaining to
population of origin
7Why has replicationproved to be so difficult?
- LOW STATISTICAL POWER!!
- A key feature of almost all proffered
explanations, and/or of the approaches needed to
correct for them - If we need 5,000 cases to test for a given
aetiological effect with a power of 80, and with
a critical p-value of 0.0001, how much power
would there be for a study with 500 cases?
8Why has replicationproved to be so difficult?
- LOW STATISTICAL POWER!!
- A key feature of almost all proffered
explanations, and/or of the approach needed to
correct for them - If we need 5,000 cases to test for a given
aetiological effect with a power of 80, and with
a critical p-value of 0.0001, how much power
would there be for a study with 500 cases?
?0.008!!
9How should we respond?
- Increase the quality of individual studies
- Limit measurement/assessment error
- Increase the size of individual studies
- Promote harmonization to enable data pooling and
integration
10How should we respond?
- Increase the quality of individual studies
- Limit measurement/assessment error
- Increase the size of individual studies
- Promote harmonization to enable data pooling and
integration
? MAJOR international investment in biobanks
and biobank harmonization
11What is a biobank?
- An organised collection of human biological
material and associated information stored for
one or more research purposes - Population Biobanks Lexicon (P3G, PHOEBE)
- Types
- Disease-specific
- Exposure-focused
- Population-based
12Justification for large-scalegenetic-epidemiology
programs
13BIG per se
- No argument about
- Need to increase statistical power
- Benefit of constructing biobanks containing
extensive case-series for case-control studies - Benefit of constructing large acceptably
representative series of controls for each nation
14BIG cohort studies
- Studies of the joint effects of genes and
environment/life-style - Genotype-based studies
- The genetics of disease progression
- Direct association of genes with disease
- Population-based replication studies
- Universal controls
15BUT how big is big?With Anna Hansell,
Imperial College
16The statistical power ofcase-control studies
- Contemporary pre-eminence of genetic association
studies rather than genetic linkage studies - Covers both stand-alone case-control studies, and
nested case-control studies in large cohorts.
Main issue is the number of cases. - Sample size determining in both settings
17Simulation-based power calculations
- Work with the least powerful (common) setting
- Disease outcome and exposures all binary
- Logistic regression interactions departure
from a multiplicative model - Complexity (arbitrary but realistic).
- Four controls per case
18Diabetes mellitus defined by HbA1C 97.5
percentile
19Genetic main effects
Prevalence of at-risk genotype 0.1, 0.5
20Lifestyle main effects
Prevalence of at-risk life-style determinant
0.5
Reliability 1.0 measured height 0.9 self
reported weight 0.7 office BP, measured
serum cholesterol 0.5 dietary recall of
many components (424 hr recalls)
21Gene-lifestyle interactions
Prevalence of at-risk genotype 0.1
Prevalence of at-risk life-style determinant
0.5
22Mean power ? 55
23What is needed?
- Genetic main effects
- 2,500-10,000 cases
- Life-style main effects
- 5,000-20,000 cases
- Gene-lifestyle interactions
- Probably need at least 20,000 cases
24How can this be achieved?
- Large disease-based biobanks
- Very large cohort-based biobanks
- But how large do these need to be?
25Expected event ratesin UK BiobankWith Anna
Hansell, Imperial College
26Taking account of
- Age range at recruitment 40-69 years
- Recruitment over 5 years
- All cause mortality
- Disease incidence (healthy cohort effect)
- Migration overseas
- Withdrawal from the study
27(No Transcript)
28Conclusions
- Having taken account of realistic bioclinical
complexity, a cohort-based biobank needs to be
very large if it is to provide a stand-alone
infrastructure - Anything much less than 500,000 recruits severely
curtails the number of diseases that will be able
to be studied based on that biobank alone - The value of any biobank will be greatly
augmented if it proves possible to set up a
coherent and scientifically harmonized
international network of biobanks
29What is biobank harmonization?
30Biobank harmonization
- A set of procedures that promote, both now and in
the future, the effective interchange of valid
information and samples between a number of
studies or biobanks, accepting that there may be
important differences between those studies - With thanks to Alastair Kent
31Biobank harmonization
- Prospective harmonization
- Aims to modify study design and conduct, ahead of
time, in order to render subsequent data and
sample pooling more efficient and more
straightforward - Retrospective harmonization
- Aims to optimize the pooling of data, samples and
phenotypes that have already been collected,
between studies with inevitably heterogeneous
designs.
32Why harmonize?
- Investigate less common (but not rare!!!)
conditions - UKBB Ca stomach 2,500 cases in 29 years
- 6 UKBB equivalents ? 10,000 cases in 20 years
- Investigate smaller ORs
- GME 1.5 ? 1.2 requires 2,000 ? 12,600
- 6.3 UKBB equivalents
- Analysis based on subsets homogeneous classes
of phenotype, or e.g. by sex
33Why harmonize?
- Earlier analyses
- UKBB Alzheimers disease, 10,000 cases in 18 yrs
- 5 UKBB equivalents ? 9 years
- Events at younger ages
- Broad range of environmental exposures
- Aim for 5-6 UKBB equivalents
- 2.5M 3M recruits
34Some key issues
- Scientifically and politically VERY challenging
- Laboratory science, clinical science, population
science, IT challenges, ethico-legal issues - A need for REAL collaboration and tools that are
ACCESSIBLE and USABLE - Case-control and cohort studies
35International biobankharmonization programs
- Public Population Program in Genomics (P3G)
- Tom Hudson, Bartha Knoppers, Isabel Fortier
- Population Biobanks
- FP6 Co-ordination Action (PHOEBE Promoting
Harmonization Of Epidemiological Biobanks in
Europe) - Jennifer Harris, Leena Peltonen, Paul Burton
- Human Genome Epidemiology Network (HuGENet)
- Muin Khoury, Julian Little
- ESSENTIAL THAT ALL INITIATIVES WORK TOGETHER!!
36Extra slides
37Rarer genotypes
38Proposed assessment visit model
39Taking account of
- Age range at recruitment 40-69 years
- Recruitment over 5 years
- All cause mortality
- Disease incidence (healthy cohort effect)
- Migration overseas
- Comprehensive withdrawal (max 1/500 p.a.)
- Partial withdrawal (c.f. 1958 Birth Cohort)
40(No Transcript)
41Necessary to contact subjects
42Issues that are often ignored in standard power
calculations
- Multiple testing/low prior probability of
association - Interactions
- Unobserved frailty
- Misclassification
- Genotype
- Environmental determinant
- Case-control status
- Subgroup analyses
- Population substructure
43Harmonisation
- Prospective
- Retrospective
- Description
- Comparison
- Harmonised synthesis
44(No Transcript)
45(No Transcript)
46Recruitment and assessment
- Recruitment via centrally held list of
individuals registered with Primary Care
Practitioners (GPs) - Assessment in large centres (?100 subjects per
day) - Assessment ? 70 minutes
- Questionnaire, physical examination, bloods
47Assessment visit model
48Summary
- 80 power for genotype frequency 0.1
- Genetic main effect ? 1.5, p10-4 ? 2,000 cases
- Genetic main effect ? 1.3, p10-4 ? 5,500 cases
- Genetic main effect ? 1.2, p10-4 ? 12,600 cases
- Genetic main effect ? 1.7, p10-7 ? 2,000 cases
- Genetic main effect ? 1.5, p10-7 ? 3,400 cases
- Genetic main effect ? 1.3, p10-7 ? 9,500 cases
- Genetic main effect ? 1.2, p10-7 ? 21,500
cases - GE interaction with environmental exp.
prevalence - 0.5 ? 2.0, p10-4 ? 10,000 to 30,000
cases
49UK Biobank
- A prospective cohort study
- 500,000 adults (40-69 years) across UK
- A population-based biobank
- Not disease or exposure based
- Recruitment via electronic GP lists
- Broad spectrum not fully representative
- Individuals not families
- MRC, Wellcome Trust, DH, Scottish Executive
- 61M
50UK Biobank
- Initial data/sample collection and subsequent
longitudinal health tracking - Nested case-control studies
- Long time-horizon
- Owned by the Nation
- Central Administration Manchester
- PI Prof Rory Collins - Oxford
- 6 collaborating groups (RCCs) of university
scientists
51Smaller sample sizes