Title: Statistical%20Issues%20in%20Development%20and%20Evaluation%20of%20Genetic%20Risk%20Prediction%20Models
1Statistical Issues in Development and Evaluation
of Genetic Risk Prediction Models
- Nilanjan Chatterjee, PhD
- Chief and Senior Investigator
- Biostatistics Branch, Division of Cancer
Epidemiology and Genetics
2Thanks to team science!
- Biostatistics Branch
- JuHyun Park, Fellow
- Paige Maas, Fellow
- Jianxin Shi, TT Investigator
- Joshua Sampson, TT Investigator
- Bin Zhu, TT Investigator
- Mitchell Gail, Investigator
- Minsun Song, Fellow
- DCEG
- Stephen Chanock, Director
- Nat Rothman, Investigator
- Debra Silverman, Investigator
Other Institutions/Collaborations Peter Kraft,
HSPH Montserrat Garcia-Closas, ICR, UK Cambridge
University, UK German Cancer Research Center BPC3
Consortium BCAC Consortium
3(No Transcript)
4Utility of Risk Models
- Individual counseling
- weighing risks and benefits for various
preventive interventions - Screening, medication, risk-factor modification
- Understanding distribution of risk at
population-level and inform public heath
strategies for prevention - Comparative effectiveness studies
- Design of intervention trial
5Methodological Issues
- Sample size and study design
- Model building
- Polygenic risk score (PRS)
- Incorporating environmental risk-factors
- Using external information
- Model calibration
- Model validation and evaluation
6Limited Discriminatory Ability of Early GWAS
Discoveries
A tiny step to personalized risk prediction of
breast cancer
- Devilee and Rookus, NEJM, Editorial
7Many more to be found
8Utility of Foreseeable Cancer SNPs
Cancer Site Family History Only Known SNPs Foreseeable SNPs Family History and Known SNPs Family History and Foreseeable SNPs Epidemiologic Risk-Factors and Foreseeable SNPs
BREAST 0.536 0.599 0.635 0.613 0.646 0.670
PROSTATE 0.549 0.647 0.676 0.668 0.694
COLORECTUM 0.528 0.582 0.616 0.598 0.629 0.658
OVARY 0.509 0.557 0.568 0.564 0.575
BLADDER 0.514 0.596 0.615 0.602 0.620 0.726
GLIOMA 0.503 0.597 0.621 0.598 0.622
PANCREAS 0.517 0.576 0.600 0.588 0.610
Park et al., JCO, 2012
9Hidden Heritability for Complex Traits
Trait HT BMI TC HDL LDL CD T1D T2D PrCA CAD
Narrow sense heritability ( ) 0.45 0.14 - 0.12 - 0.22 0.30 0.51 0.22 -
Effective sample-size for the largest GWAS 133K 162K 100K 100K 95K 25K 22K 36K 28K 73K
No. of detected SNPs 108 31 45 35 36 64 30 22 20 21
Heritability explained by detected SNPs 0.066 0.014 0.063 0.046 0.059 0.066 0.053 0.034 0.061 0.024
- Heritability fraction of total variance
attributable to susceptibility (Quantitative
traits) and sibling-recurrence-risks (Qualitative
traits)
10Challenges
- Many loci with very small effects are
undetectable at genome-wide significance level - Can we still exploit them to improve risk
prediction? - Using a more liberal threshold or a fancier
penalized regression method? - Needs an understanding of power in the context
of prediction
11Predictive Correlation Coefficient (PCC)
- covariances and variances are taken with respect
to randomness of a new observation for which
prediction is desired - Remaining randomness is due to that of the
training dataset
12The Expected PCC value for GWAS Polygenic Models
- Parameters of genetic architecture
- Properties of the statistical method
- For fixed N, optimal threshold (opt(N)) can be
chosen by maximizing ¹(N,)
Chatterjee et al, Nature Genetics, 2013
13Further Results
- Many measures of discriminatory performance of
risk-model have a one-to-one relationship with
PCC - Can project performance of models that include
polygenic-risk-score (PRS) and family history - Family hx effect is attenuated by a quantity
related to PCC
Chatterjee et al., Nature Genetics, 2013
14AUC (Contd)
Trait (AUC with FH alone) Model Current Sample size (N) Current Sample size (N) 3xN 3xN 5xN 5xN
Trait (AUC with FH alone) Model a10-7 aOPT a10-7 aOPT a10-7 aOPT
T2D (0.595) SNPs 0.570 0.598 0.617 0.704 0.660 0.750
T2D (0.595) SNPsFH 0.632 0.654 0.667 0.736 0.700 0.776
PrCA (0.552) SNPs 0.621 0.625 0.637 0.648 0.646 0.673
PrCA (0.552) SNPsFH 0.648 0.651 0.661 0.670 0.669 0.692
CAD (0.601) SNPs 0.582-0.584 0.587-0.589 0.595-0.604 0.612-0.650 0.603-0.629 0.635-0.676
CAD (0.601) SNPsFH 0.647-0.648 0.651-0.652 0.656-0.663 0.669-0.697 0.663-0.681 0.686-0.717
15Architecture of Joint Effects Implications for
Disease Prevention
16Breast Cancer Risk Modeling BPC3 Study
- 17,176 cases and 19,860 controls from 8
prospective studies - Risk factors
- Family history, height, reproductive
risk-factors, smoking, BMI, alcohol and HRT use - SNPs
- 24 genotyped SNPs, imputed PRS for 86 SNPs
17Steps for Building Absolute Risk Model and
Projecting Risk Distribution
- Develop models for relative-risk
- Construction of efficient PRS, Model selection
for gene-gene/gene-environment interaction - Utilize rates from SEER cancer registry to
calibrate absolute risk to the US population - Use national survey data to project risk
distribution
18Gene-gene/Gene-Environment Interactions in
Disease-risk
- Interaction in what scale?
- Logistic, probit (liability threshold),
additive - Little evidence of SNP-SNP/SNP-E interactions
under the logistic scale - Lack of power or are risks truly multiplicative?
- Does the scale matter?
- Important to have good model-fit at extremes of
disease risks - Clinically important
19Linear Logistic vs Linear Additive Null Models
- Linear logistic
- Linear additive
- Can be fitted in the logistic scale under rare
disease assumption
20(No Transcript)
21(No Transcript)
22A Tail-based Goodness-of-fit Test (also a global
test for interaction)
Song et al. (Biostatistics, In Press)
23Multiplicative Model Multiplicative Model Multiplicative Model Multiplicative Model Additive Model Additive Model
Complete case analysis Complete case analysis Analysis including subjects with missing genotypes Analysis including subjects with missing genotypes Complete case analysis Complete case analysis
Hom OR Het OR Hom OR Het OR Hom OR Het OR
Hosmer and Lemeshow test 0.11 0.87 . . 0.0003 0.01
Tail-based Test
C25 0.11 0.85 0.16 0.11 0 0
C100 0.20 0.77 0.23 0.17 0 0
24Statistically Speaking
- Multiplicative model could not be rejected even
with a large dataset and a powerful method - Fit seems adequate even at extremes
- Modest departure cannot be ruled out
- Additive model is soundly rejected
- Plethora of gene-gene interactions in the
additive scale
25Does the Scale Matter Clinically?
- Stronger risk variation (or risk stratification)
under the multiplicative than the additive model - Proportion of the population identified at 2 fold
or higher than average risk - 1.16 under multiplicative model
- 0.02 under additive model
- Correlation in PRS under two model 0.93 (AUC is
hardly different)
26Concluding Remarks
- Translating heritability to predictability is
hard - Due to highly polygenic (non-sparse) architecture
- Multiplicative model for gene-gene and
gene-environment interaction works amazingly well - Time to seriously think about public health
implications for joint effects - Evaluate risk stratification
- Stop using AUC
27(No Transcript)
28(No Transcript)