Title: Genome-wide association studies (GWAS)
1Genome-wide association studies (GWAS)
Thomas Hoffmann
2Outline for GWAS
- Review / Overview
- Design
- Analysis
- QC
- Prostate cancer example
- Imputation
- Replication Meta-analysis
- Advanced analysis intro (more next lecture)
- Limitations missing heritability
- Gene/pathway tests
- Polygenic models
3Outline for GWAS
- Review / Overview
- Design
- Analysis
- QC
- Prostate cancer example
- Imputation
- Replication Meta-analysis
- Advanced analysis intro (more next lecture)
- Limitations missing heritability
- Gene/pathway tests
- Polygenic models
4Manolio et al., Clin Invest 2008
5(No Transcript)
6Recap Association studies
(guilt by association)
Hirschhorn Daly, Nat Rev Genet 2005
7GWAS Microarray
Assay 0.7 - 5M SNPs (keeps increasing)
Affymetrix, http//www.affymetrix.com
8Genotype calls
Bad calls!
Good calls!
9Outline for GWAS
- Review / Overview
- Design
- Analysis
- QC
- Prostate cancer example
- Imputation
- Replication Meta-analysis
- Advanced analysis intro (more next lecture)
- Limitations missing heritability
- Gene/pathway tests
- Polygenic models
10Genome-wide assocation studies (GWAS)
11One- and two-stage GWA designs
Two-Stage Design
One-Stage Design
SNPs
SNPs
nsamples
Stage 1
Samples
Samples
Stage 2
nmarkers
12One-Stage Design
SNPs
Samples
Two-Stage Design
Replication-based analysis
Joint analysis
SNPs
SNPs
1
1
Stage 1
Stage 1
Samples
Samples
Stage 2
Stage 2
2
2
13Multistage Designs
- Joint analysis has more power than replication
- p-value in Stage 1 must be liberal
- Lower costdo not gain power
- CaTs power calculator http//www.sph.umich.edu/cs
g/abecasis/CaTS/index.html
14Genome-wide Sequence Studies
- Trade off between number of samples, depth, and
genomic coverage.
MAF MAF
Sample Size Depth 0.5-1 2-5 2-5
1,000 20x perfect perfect perfect
2,000 10x r20.98 r20.995 r20.995
4,000 5x r20.90 r20.98 r20.98
More later in Next generation sequencing (NGS)
lecture
Goncalo Abecasis
15Near-term sequencing design choices
- For example, between
- Sequencing few subjects with extreme phenotypes
- e.g., 200 cases, 200 controls, 4x coverage. Then
follow-up in larger population. - 10M SNP chip based on 1,000 genomes.
- 5K cases, 5K controls.
- Which design will work best?
- More later in Next generation sequencing (NGS)
lecture
16Design choices
- GWAS Microarray
- Only assay SNPs designed into array (0.7-5
million) - Much cheaper (so many more subjects)
- Genotypes currently more reliable
- GWAS Sequencing
- De novo discovery (particularly good for rare
variants) - More expensive (but costs are falling) (many less
subjects) - Need much more expansive IT support
- Lots of interesting interpretation problems
(field rapidly evolving)
17Design choices
- Exome Microarray
- Only assay SNPs designed into array
(300Kcustom) in exons only and that could
affect protein coding function - Cheapest (so many more subjects)
- Genotypes currently more reliable (some question
about rarest, but preliminary results good)
- Exome Sequencing
- De novo discovery (particularly good for rare
variants) age of exons only - More expensive than microarrays, less expensive
than gwas sequencing - Need more expansive IT support
- Lots of interesting interpretation problems
18Size of study
Visscher, AJHG 2012,
19Size of study
Visscher, AJHG 2012,
20Biggest studies...
- GWAS Microarray 100,000 People in the Kaiser
RPGEH, still to be analyzed (Hoffmann et al.,
Genomics, 2011ab) - Sequencing 1000 Genomes Project (though not
disease focused, low coverage issues) - Exome Sequencing GO ESP (12,031 subjects, for
exome microarray design)
21Outline for GWAS
- Review / Overview
- Design
- Analysis
- QC
- Prostate cancer example
- Imputation
- Replication Meta-analysis
- Advanced analysis intro (more next lecture)
- Limitations missing heritability
- Gene/pathway tests
- Polygenic models
22QC Steps
- Filter SNPs and Individuals
- MAF, Low call rates
- Test for HWE among controls within ethnic
groups. Use conservative alpha-level. - Check for relatedness. Identity-by-state
calculations. - Check genotype gender
- Filter Mendelian inhertance (family-based, or
potentially cryptics, if large enough sample)
23Check for relatedness, e.g., HapMap
- Pemberton et al., AJHG 2010
24GWAS analysis
- Most common approach look at each SNP
one-at-a-time. - Possibly add in multi-marker information.
- Further investigate / report top SNPs only.
- Or backwards replication
P-values
25GWAS analysis
- Additive coding of SNP most common, just a
covariate in a regression framework - Dichotomous phenotype logistic regression
- Continuous phenotype linear regression
- Correct for multiple comparisons
- e.g., Bonferroni, 1 million gives ?5x10-8
- more next time
- Adjust for potential population stratification
- principal components (PCs), on best performing
SNPs - software usually does LD filter (e.g. Eigensoft)
26Adjusting for PCs (recap)
Balding, Nature Reviews Genetics 2010
27Adjusting for PC's
28Adjusting for PC's
- Razib, Current Biology 2008
29Adjusting for PC's
30QQ-plots and PC adjustment
31Quantile-quantile (QQ) plot
32Example GWAS of Prostate Cancer
chromosome
http//cgems.cancer.gov
Multiple prostate cancer loci on 8q24
Witte, Nat Genet 2007
33Prostate Cancer Replications
Locus A Freq A Freq Association Association
Chr Reg SNP Cntrl Case Case OR p value Nearby Genes / Fcn Nearby Genes / Fcn
2p15 rs721048 G/A 0.19 0.21 0.21 1.15 7.7x10-9 EHBP1 endocytic trafficking EHBP1 endocytic trafficking
3p12 rs2660753 C/T 0.10 0.12 0.12 1.30 2.7x10-8 Intergenic Intergenic
6q25 rs9364554 C/T 0.29 0.33 0.33 1.21 5.5x10-10 SLC22A3 drugs and toxins. SLC22A3 drugs and toxins.
7q21 rs6465657 T/C 0.46 0.50 0.50 1.19 1.1x10-9 LMTK2 endosomal trafficking LMTK2 endosomal trafficking
8q24 (2) rs16901979 C/A 0.04 0.06 0.06 1.52 1.1x10-12 Intergenic Intergenic
8q24 (3) rs6983267 T/G 0.50 0.56 0.56 1.25 9.4x10-13 Intergenic Intergenic
8q24 (1) rs1447295 C/A 0.10 0.14 0.14 1.42 6.4x10-18 Intergenic Intergenic
10q11 rs10993994 C/T 0.38 0.46 0.46 1.38 8.7x10-29 MSMB suppressor prop. MSMB suppressor prop.
10q26 rs4962416 T/C 0.27 0.32 0.32 1.18 2.7x10-8 CTBP2 antiapoptotic activity CTBP2 antiapoptotic activity
11q13 rs7931342 T/G 0.51 0.56 0.56 1.21 1.7x10-12 Intergenic Intergenic
17q12 rs4430796 G/A 0.49 0.55 0.55 1.22 1.4x10-11 HNF1B suppressor properties HNF1B suppressor properties
17q24 rs1859962 T/G 0.46 0.51 0.51 1.20 2.5x10-10 Intergenic Intergenic
19q13 rs2735839 A/G 0.83 0.87 0.87 1.37 1.5x10-18 KLK2/KLK3 PSA KLK2/KLK3 PSA
Xp11 rs5945619 T/C 0.36 0.41 0.41 1.29 1.5x10-9 NUDT10, NUDT11 apoptosis NUDT10, NUDT11 apoptosis
Witte, Nat Rev Genet 2009
Modest ORs
34Prostate Cancer Replications
Locus A Freq A Freq Association Association
Chr Reg SNP Cntrl Case Case OR p value Nearby Genes / Fcn Nearby Genes / Fcn
2p15 rs721048 G/A 0.19 0.21 0.21 1.15 7.7x10-9 EHBP1 endocytic trafficking EHBP1 endocytic trafficking
3p12 rs2660753 C/T 0.10 0.12 0.12 1.30 2.7x10-8 Intergenic Intergenic
6q25 rs9364554 C/T 0.29 0.33 0.33 1.21 5.5x10-10 SLC22A3 drugs and toxins. SLC22A3 drugs and toxins.
7q21 rs6465657 T/C 0.46 0.50 0.50 1.19 1.1x10-9 LMTK2 endosomal trafficking LMTK2 endosomal trafficking
8q24 (2) rs16901979 C/A 0.04 0.06 0.06 1.52 1.1x10-12 Intergenic Intergenic
8q24 (3) rs6983267 T/G 0.50 0.56 0.56 1.25 9.4x10-13 Intergenic Intergenic
8q24 (1) rs1447295 C/A 0.10 0.14 0.14 1.42 6.4x10-18 Intergenic Intergenic
10q11 rs10993994 C/T 0.38 0.46 0.46 1.38 8.7x10-29 MSMB suppressor prop. MSMB suppressor prop.
10q26 rs4962416 T/C 0.27 0.32 0.32 1.18 2.7x10-8 CTBP2 antiapoptotic activity CTBP2 antiapoptotic activity
11q13 rs7931342 T/G 0.51 0.56 0.56 1.21 1.7x10-12 Intergenic Intergenic
17q12 rs4430796 G/A 0.49 0.55 0.55 1.22 1.4x10-11 HNF1B suppressor properties HNF1B suppressor properties
17q24 rs1859962 T/G 0.46 0.51 0.51 1.20 2.5x10-10 Intergenic Intergenic
19q13 rs2735839 A/G 0.83 0.87 0.87 1.37 1.5x10-18 KLK2/KLK3 PSA KLK2/KLK3 PSA
Xp11 rs5945619 T/C 0.36 0.41 0.41 1.29 1.5x10-9 NUDT10, NUDT11 apoptosis NUDT10, NUDT11 apoptosis
Witte, Nat Rev Genet 2009
Modest ORs
35SNPs Missed in Replication?
Locus A Freq A Freq Association Association
Chr Reg SNP Cntrl Case Case OR p value Nearby Genes / Fcn Nearby Genes / Fcn
2p15 rs721048 G/A 0.19 0.21 0.21 1.15 7.7x10-9 EHBP1 endocytic trafficking EHBP1 endocytic trafficking
3p12 rs2660753 C/T 0.10 0.12 0.12 1.30 2.7x10-8 Intergenic Intergenic
6q25 rs9364554 C/T 0.29 0.33 0.33 1.21 5.5x10-10 SLC22A3 drugs and toxins. SLC22A3 drugs and toxins.
7q21 rs6465657 T/C 0.46 0.50 0.50 1.19 1.1x10-9 LMTK2 endosomal trafficking LMTK2 endosomal trafficking
8q24 (2) rs16901979 C/A 0.04 0.06 0.06 1.52 1.1x10-12 Intergenic Intergenic
8q24 (3) rs6983267 T/G 0.50 0.56 0.56 1.25 9.4x10-13 Intergenic Intergenic
8q24 (1) rs1447295 C/A 0.10 0.14 0.14 1.42 6.4x10-18 Intergenic Intergenic
10q11 rs10993994 C/T 0.38 0.46 0.46 1.38 8.7x10-29 MSMB suppressor prop. MSMB suppressor prop.
10q26 rs4962416 T/C 0.27 0.32 0.32 1.18 2.7x10-8 CTBP2 antiapoptotic activity CTBP2 antiapoptotic activity
11q13 rs7931342 T/G 0.51 0.56 0.56 1.21 1.7x10-12 Intergenic Intergenic
17q12 rs4430796 G/A 0.49 0.55 0.55 1.22 1.4x10-11 HNF1B suppressor properties HNF1B suppressor properties
17q24 rs1859962 T/G 0.46 0.51 0.51 1.20 2.5x10-10 Intergenic Intergenic
19q13 rs2735839 A/G 0.83 0.87 0.87 1.37 1.5x10-18 KLK2/KLK3 PSA KLK2/KLK3 PSA
Xp11 rs5945619 T/C 0.36 0.41 0.41 1.29 1.5x10-9 NUDT10, NUDT11 apoptosis NUDT10, NUDT11 apoptosis
24,223 smallest P-value!
Witte, Nat Rev Genet, 2009
36Population Attributable Risks for GWAS
Smoking lung cancer
BRCA1 Breast cancer
Jorgenson Witte, 2009
37Imputation of SNP Genotypes
- Combine data from different platforms (e.g., Affy
Illumina) (for replication / meta-analysis). - Estimate unmeasured or missing genotypes.
- Based on measured SNPs and external info (e.g.,
haplotype structure of HapMap). - Increase GWAS power (impute and analyze all),
e.g. Sick sinus syndrome, most significant was
1000 Genomes imputed SNP (Holm et al., Nature
Genetics, 2011) - HapMap as reference, now 1000 Genomes Project?
38Imputation Example
Study Sample
HapMap/ 1K genomes
Gonçalo Abecasis
- http//www.shapeit.fr/, http//mathgen.stats.ox.ac
.uk/impute/impute_v2.html - http//faculty.washington.edu/browning/beagle/beag
le.html - http//www.sph.umich.edu/csg/abecasis/MACH/downloa
d/
39Identify Match with Reference
Gonçalo Abecasis
- http//www.shapeit.fr/, http//mathgen.stats.ox.ac
.uk/impute/impute_v2.html - http//faculty.washington.edu/browning/beagle/beag
le.html - http//www.sph.umich.edu/csg/abecasis/MACH/downloa
d/
40Phase chromosomes, impute missing genotypes
Gonçalo Abecasis
- http//www.shapeit.fr/, http//mathgen.stats.ox.ac
.uk/impute/impute_v2.html - http//faculty.washington.edu/browning/beagle/beag
le.html - http//www.sph.umich.edu/csg/abecasis/MACH/downloa
d/
41Imputation Application
TCF7L2 gene region T2D from the WTCCC data
Observed genotypes black Imputed genotypes
red.
Chromosomal Position
Marchini Nature Genetics2007 http//www.stats.ox.a
c.uk/marchini/software
42Replication
- To replicate
- Association test for replication sample
significant at 0.05 alpha level - Same mode of inheritance
- Same direction
- Sufficient sample size for replication
- Non-replications not necessarily a false positive
- LD structures, different populations (e.g.,
flip-flop) - covariates, phenotype definition, underpowered
43Meta-analysis
- Combine multiple studies to increase power
- Either combine p-values (Fishers test),
- or z-scores (better)
44(Meta-analysis)Example GWAS of Prostate Cancer
chromosome
http//cgems.cancer.gov
Multiple prostate cancer loci on 8q24
Witte, Nat Genet 2007
45Replication Meta-analysis
46Meta-analysis
47Outline for GWAS
- Review / Overview
- Design
- Analysis
- QC
- Prostate cancer example
- Imputation
- Replication Meta-analysis
- Advanced analysis intro (more next lecture)
- Limitations missing heritability
- Gene/pathway tests
- Polygenic models
48Limitations of GWAS
Example AUC for Breast Cancer Risk 58 Gail
model ( first degree relatives w bc, age
menarche, age first live birth, number of
previous biopsies) age, study, entry
year 58.9 SNPs 61.8 Combined Wacholder et
al., NEJM 2010
Witte, Nat Rev Genet 2009
49Limitations of GWAS
- Not very predictive
- Explain little heritability
- Focus on common variation
- Many associated variants are not causal
50Where's the heritability?
Visccher, AJHG 2011
51Wheres the heritability?
Common disease rare variant (CDRV) hypothesis
diseases due to multiple rare variants with
intermediate penetrances (allelic heterogeneity)
Many more of these?
See NEJM, April 30, 2009
McCarthy et al., 2008
52Will GWAS results explain more heritability?
- Possibly, if
- Causal SNPs not yet detected due to power /
practical issues (e.g., not yet included in
replication studies). - Stronger effects for causal SNPs
- Associated SNP may only serve as a marker for
multiple different causal SNPs.
53Gene/pathway-based tests
- Various ways of collapsing the genotype
information in multiple genes - Less multiple comparison adjustment
- logit (Prob(y1 x, c)) ? ?x ?c
- e.g.??1x1?12x12PCs, other covariates
- y disease status)
- x is a vector of genotypes (e.g., a gene, or a
pathway) - c is a vector of covariates
- H0 ?0
54Gene/pathway-based tests
- logit (Prob(y1 x, c)) ? ?x ?c
- e.g.??1x1?12x12?1PC1?4PC4...
- One example Kernel machine Question from last
time - Simplest case of linear kernel reduces to
linear/logistic regression (model above) - More complicated function of genotypes can be
tested, e.g., interactions, etc. - Gory details Variance components score test
h(x) in paper (Wu et al., AJHG 2010)
55- Pathways - how to define?
- Many websites / companies provide dynamic
graphic models of molecular and biochemical
pathways. - Example BioCarta http//www.biocarta.com/
- May be interested in potential joint and/or
interaction effects of multiple genes in one
pathway.
56Polygenic Models
- Many weak associations combine to risk?
- Score model (use all GWAS SNPs)
- where
- ln(ORi ) score for SNPi from discovery
sample - SNPij of alleles (0,1,2) for SNPi, person j
in validation sample. - Large number of SNPs (m)
- xj associated with disease?
ISC / Purcell et al. Nature 2009
57Application of Model
Purcell / ISC et al. Nature 2009
58Application to CGEMs PCa GWAS
- 1,172 cases, 1,157 controls from PLCO Trial
- Oversampled more aggressive cases.
- Illumina 550K array.
- PCa stratified by disease aggressiveness.
- Split into halves, resampling
- one as discovery sample
- other as validation.
- LD filter r2 0.5.
Witte Hoffmann, OMICs 2010
59Results for Prostate Cancer
60Common Polygenic Model for Prostate and Breast
Cancer?
- CGEMs GWAS data on prostate and breast cancer.
- Use one cancer as discovery sample, the other
as validation.
Nat Rev Cancer 201010205-212
61Results for PCa BrCa
62Complex diseases
Physical activity
Genetic susceptibility
Obesity
Hyperlipidemia
Diet
Diabetes
Vulnerable plaques
Hypertension
MI
Atherosclerosis
Complex diseases Many causes many causal
pathways!
63Moving Beyond Genome
Transcriptome All messenger RNA molecules
(transcripts) Proteome All proteins in cell
or organism Metabolome all metabolites in a
biological organism (end products of its gene
expression).