Title: Genome-Wide Association Studies (GWAS)
1Genome-Wide Association Studies (GWAS)
- Epidemiology 243
- Molecular Epidemiology of Cancer
- Spring 2008
2Association Studies of Genetic Factors
- 1st generation
- Very small studies (lt100 cases)
- Usually not epidemiologic study design 1-2 SNPs
- 2nd generation
- Small studies (100-500 cases)
- More epi focus a few SNPs
- 3rd generation
- Large molecular epi studies (gt500 cases)
- Proper epi design pathways
- 4th generation
- Consortium-based pooled analyses (gt2000 cases)
- GxE analyses
- 5th generation
- Post-GWS studies
Boffeta, 2007
3International Lung Cancer Consortium (ILCCO)
Wichmann
Risch
McLaughlin
Schwarts
Wild
Boffetta
Kiyohara
Harris
Brennan
Goodman
Benhamou
Wiencke
Tajima
Christiani
Zhang
Landi
Hong
Stucker
Vineis
Yang
Chen
Berwick
Lan
Lazarus
Spitz
Thun
Le Marchand
3 cohort studies 17
population based case-control studies
13 hospital based case-control studies
2 studies with mixed controls
1 cross-sectional
study
4Issues in genetic association studies
- Many genes
- 25,000 genes, many can be candidates
- Many SNPs
- 12,000,000 SNPs, ability to predict functional
SNPs is limited - Methods to select SNPs
- Only functional SNPs in a candidate gene
- Systematic screen of SNPs in a candidate gene
- Systematic screen of SNPs in an entire pathway
- Genomewide screen
- Systematic screen for all coding changes
5Introduction
- A genome-wide association study is an approach
that involves rapidly scanning markers across the
complete sets of DNA, or genomes, of many people
to find genetic variations associated with a
particular disease. - Once new genetic associations are identified,
researchers can use the information to develop
better strategies to detect, treat and prevent
the disease. Such studies are particularly useful
in finding genetic variations that contribute to
common, complex diseases, such as asthma, cancer,
diabetes, heart disease and mental illnesses.
http//www.genome.gov/20019523
6Definition of GWAS
- A genome-wide association study is defined as
any study of genetic variation across the entire
human genome that is designed to identify genetic
associations with observable traits (such as
blood pressure or weight), or the presence or
absence of a disease (such as cancer) or
condition.
7Potential of GWAS
- Whole genome information, when combined with
epidemiological, clinical and other phenotype
data, offers the potential for increased
understanding of basic biological processes
affecting human health, improvement in the
prediction of disease and patient care, and
ultimately the realization of the promise of
personalized medicine. - In addition, rapid advances in understanding the
patterns of human genetic variation and maturing
high-throughput, cost-effective methods for
genotyping are providing powerful research tools
for identifying genetic variants that contribute
to health and disease.
8Potential of GWAS
9(No Transcript)
10Selection of SNPs(Genome-wide association
studies)
- Molecular
- Higher requirements Affymetrix and Illumina
- Analytical
- Highest requirements Data management, automation
- Advantages
- No biological assumptions and can identify novel
genes/pathways - Excellent chance to identify risk alleles
- Utility in individual risk assessment
- Disadvantages
- High costs
- Concern of multiple tests
11SNP Selection
12SNP Selection
13Affymetrix Genome-Wide Human SNP Array
- The new Affymetrix Genome-Wide Human SNP Array
6.0 features 1.8 million genetic markers,
including more than 906,600 single nucleotide
polymorphisms (SNPs) and more than 946,000 probes
for the detection of copy number variation. The
SNP Array 6.0 represents more genetic variation
on a single array than any other product,
providing maximum panel power and the highest
physical coverage of the genome.
14The need for GWA
- Current understanding of disease etiology is
limited - Therefore, candidate genes or pathways are
insufficient - Current understanding of functional variants is
limited - Therefore, the focusing on nonsynonymous changes
is not sufficient - Results from linkage studies are often
inconsistent and broad - Therefore, the utility of identified linkage
regions is limited - GWA studies offer an effective and objective
approach - Better chance to identify disease associated
variants - Improve understanding of disease etiology
- Improve ability to test gene-gene interaction and
predict disease risk
Xu JF, 2007
15GWA is promising
- Many diseases and traits are influenced by
genetic factors - i.e., they are caused by sequence variants in the
genome - Over 12 millions SNPs are known in the genome
- i.e., some SNPs will be directly or indirectly
associated with causal variants - The cost of SNP Genotyping is reduced
- i.e., it is affordable to genotype a large number
of SNPs in the genome - Large numbers of cases and controls are available
- i.e., there is statistical power to detect
variants with modest effect - When the above conditions are met
- associated SNPs will have different frequencies
between cases
16GWA is challenging
- Many diseases and traits are influenced by
genetic factors - But probably due to multiple modest risk variants
- They confer a stronger risk when they interact
- True associated SNPs are not necessary highly
significant - Too many SNPs are evaluated
- False positives due to multiple tests
- Single studies tend to be underpowered
- False negatives
- Considerable heterogeneity among studies
- Phenotypic and genetic heterogeneity
- False positives due to population stratification
Xu, 2007
17Genome coverage
- Two major platforms for GWA
- Illumina HumanHap300, HumanHap550, and
HumanHap1M - Affymetrix GeneChip 100K, 500K, 1M, and 2.3M
- Genome-wide coverage
- The percentage of known SNPs in the genome that
are in LD with the genotyped SNPs - Calculated based on HapMap
- Calculated based on ENCODE
Xu, 2007
18Strategies for pre-association analysis
- Quality control
- Filter SNPs by genotype call rates
- Filter SNPs by minor allele frequencies
- Filter SNPs by testing for Hardy-Weinberg
Equilibrium
19Data Analysis
- Single SNP analysis using pre-specified genetic
models - 2 x 3 table (2-df)
- Additive model (1-df), and test for additivity
- All possible genetic models (recessive, dominant)
20Data Analysis
- Haplotype analysis
- Gene-gene and gene-environment interactions
- Interaction with main effect
- Logistic regression
- Interaction without main effect data mining
- Classification and recursive tree (CART)
- Multifactor Dimensionality Reduction (MDR)
21Sample size needs as a function of genotype
prevalence and OR for main effects
Boffeta, 2007
22(No Transcript)
23False Positives
- False positives too many dependent tests
- Adjust for number of tests
- Bonferroni correction
- Nominal significance level study-wide
significance / number of tests - Nominal significance level 0.05/500,000 10-7
- Effective number of tests
- Take LD into account
- Permutation procedure
- Permute case-control status
- Mimic the actual analyses
- Obtain empirical distribution of maximum test
statistic under null hypothesis
24False Positives
- False discovery rate (FDR)
- Expected proportion of false discoveries among
all discoveries - Offers more power than Bonferroni
- Holds under weak dependence of the tests
25False Positives
- Bayesian approach
- Taking a priori into account, False-Positive
Report Probability (FPRP)
26Confirmation in independent study populations
- The approach may limit the number of false
positives - Confirmation is needed to dissect true from false
positives - Replication, examine the results from the 2nd
stage only - Joint analysis, combining data from 1st stage
with 2nd stage - Multiple stages
27(No Transcript)
28Issues of GWAS
- Population stratification
- Multiple Testing False Positives
- Gene-Environmental Interaction
- High Costs
29Kingsmore, 2008
30Kingsmore, 2008
31(No Transcript)
32GWAS
33Proposed GWAS of Lung Cancer among Non-smokers
34Motives and Conceptual Framework For Study of
Genetic Susceptibility to Lung Cancer among
Non-smokers
- About 16 of the male smokers and 10 of female
smokers will eventually develop lung cancer,
which suggest exposures to other environmental
carcinogens and individual genetic susceptibility
may play an important role among non smoking lung
cancer. - It is suggested that 26 of lung cancer are
associated with genetic susceptibility
Lichtenstein P, et al. NEJM, 2000) - We hypothesize that the variation of genetic
susceptibility or single nucleotide polymorphisms
(SNPs) of genes in inflammation, DNA repair, and
cell cycle control pathways may be important on
the development of lung cancer among non-smokers.
35(No Transcript)
36DNA damage repaired
Defected DNA repair gene
If DNA damage not repaired
G0
If loose cell cycle control
37500K SNP Coverage Median intermarker distance
3.3 kb Mean intermarker distance
5.4 kb Average Heterozygosity
0.30 Average minor allele frequency
0.22 SNPs in genes 196,384 80 of genome within
10kb of a SNP
38Figure 1. The effects of SNPs on the Risk of Lung
Cancer among Smokers and Non-smokers
OR
39Hypothesis
- The overall hypothesis is that multiple sequence
variants in the genome are associated with the
risk of lung cancer among non-smokers.
Specifically, we hypothesize that a number of
common nonsmoking lung cancer risk-modifying SNPs
are in strong LD with the SNPs arrayed on the
500K GeneChip.
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Specific Aims
- Aim 1. To perform exploratory tests for
association between 500K SNPs across the genome
and lung cancer risk among 200 non-smoking lung
cancer patients and 200 controls. - Aim 2. To perform first stage of confirmatory
association tests between lung cancer risk and
more than 1,000 SNPs implicated in Aim 1 among an
independent set of 600 pairs of cases and
controls.
44Specific Aims
- Aim 3. To perform second stage of confirmatory
association tests between lung cancer risk and
more than 500 SNPs that were replicated in Aim 2
among an additional 600 cases and 600 controls.
Additional SNPs will also be added from our
ongoing pathway specific analyses of DNA repair,
cell cycle regulation, inflammation and metabolic
pathways based on non-smokers in our lung cancer
study. - Aim 4. To perform fine mapping association
studies in the flanking regions of each of the
30-100 SNPs confirmed in Aim 3 among the entire
1,400 cases and 1,400 controls. The large number
of cases with non-smoking lung cancer in this
study population also allows us to identify SNPs
that are associated with risk of the disease
among nonsmokers.
45Specific Aims
- Aim 5. To explore the generalizability of the
SNPs identified in Specific Aims 1-4 within a
Chinese population of 600 nonsmoking lung cancer
cases and 600 nonsmoking controls. The relatively
homogeneous Chinese population not only allows us
to further confirm the associations, but also
improves our ability to finely map the SNPs
associated with lung cancer risk among
non-smokers.
46Discussion Costs
- Affy 500 k SNP chip 1000/case
- 2000 x 10002m
- 1000 x 10001m
- 500 x 10000.5 M
- 500 x 3000 (SNP) x 0.15225, 000
- 500 x 30 (SNP) x 0.15 2,250
47(No Transcript)
48(No Transcript)