Title: Molecular
1Molecular Genetic Epi 217Association Studies
Direct
2Muddy Points?
- Linkage results reported as a maximum lod score,
for a given value of theta. - First indicates strength of linkage, second how
far away from potentially causal marker. - Reading Cordell and Clayton.
3Linkage vs. Association Studies
Rare variants Larger effects
Common variants Modest effects CD-CV hypothesis
ROCHE Genetic Education (www)
4Association Studies
- Use of association studies is rapidly expanding,
reflecting a number of laudable properties,
including their - Ease, since one need not collect large pedigrees
and - Potential for being more powerful than
conventional linkage-based approaches.
5Linkage vs. Association
Risch Merikangas, Science 1996
6Association Study Approaches
- Candidate genes
- Functional
- All common variants
- All common variants in genome (GWAS)
- All SNPs in genome
- Expensive
7Direct and Indirect Association
Direct Association
Indirect Association
Ability to undertake indirect association depends
on the LD / correlation among measured and
unmeasured variants (i.e., tagging and
coverage).
8Control Selection
- A critical aspect of association studies is that
controls should be selected from the cases
source population. - That is, controls should be those individuals
who, if they were diseased, would become cases.
9Population Stratification
- Confounding bias that may occur if ones sample
is comprised of sub-populations with different - allele frequencies (?) and
- disease rates (RpR)
- Cases are more likely than controls to arise from
the sub-population with the higher baseline
disease rate. - Cases and controls will have different allele
frequencies regardless of whether the locus is
causal.
10Example of Population Stratification
Cardon Palmer, 2003
11Bias Due to Popln. Stratification
Witte et al. Am J Epidemiol 1999
12Family-Based Association Studies
Siblings
Parents
G
G
G
G
G
G
Cousins
G
G
13Transmission Disequilibrium Test (TDT)
- Transmitted alleles vs. non-transmitted alleles
M1 M2
M2 M2
M1 M2
14TDT
- Transmitted alleles vs. non-transmitted alleles
TDT (n12 - n21)2 (n12 n21)
Asymptotically c2 with 1 degree of freedom
15TDT
TDT (1 - 0)2 (1 0)
p-value 0.32
1
16Comparison of Designs
- Family-based designs can be less efficient than
population-based designs.
Rare Recessive
Common
Rare Dominant
High Risk
Low Risk
High Risk
Population-based
100
100
100
Case-sibling
69
51
50
Case-cousin
97
88
88
TDT
231
102
101
Witte et al. Am J Epidemiol 1999
- Further, family-based designs can be require
more recruitment efforts. - How about extending the designs to include
unrelateds?
17Genomic Control
- Use population-based design, but incorporate into
analysis genomic information to adjust for
population stratification. - Genomic control adjust test statistics for
outliers due to population stratification. - Use unlinked genetic markers.
18Genomic Control
- For the gene(s) of interest, alter the test
statistic(s) from case-control comparison - ?2new ?2/?
- where
- ? mean(?21,, ?2k)
- or
- ? median(?21,, ?2k)/0.456
- 1,k index the ?2 tests for the unlinked
markers. - (Devlin Roeder, 1999 Reich Goldstein, 2000)
-
- That is, one decreases the test statistic by a
factor (?) that reflects stratification in the
population.
19Continuum of Assoc Study Designs
Population-based
Ethnicity Matched
Structured Assoc
Family-based
Population Stratification
Overmatching
(Biasversus...efficiency)
- ? Sharing of genes envt.
- Efficiency
- Also, recruitment issues
20Candidate Gene Studies
- Selection of candidates Linkage regions?
Biological support?I am interested in a
candidate gene and have samples ready to study.
What SNPs do I genotype?
21Candidate Gene Where do I Start?
- Location
- What chromosome? What position on the chr?
- Exons/UTR
- How many exons? UTR regions?
- Size
- How large is the gene?
22Candidate Gene Example MTHFRthanks to I. Cheng
- UCSC Genome Browser
- http//genome.ucsc.edu/cgi-bin/hgGateway
23Candidate Gene Example MTHFR
3
5
24SNP Picking Things to Consider
- Validation What is the quality of the SNPs?
- Informativity Are these SNPs informative in my
population? How common are they? Location? - Potentially Functional Do these SNPs have a
potential biological impact? Missense variants? - Previously Associated Have previous studies
found SNPs in the candidate gene associated with
the outcome?
25SNP Picking Database Resources
- Validation dbSNP
- http//www.ncbi.nlm.nih.gov/projects/SNP/
- Informative dbSNP
- http//www.ncbi.nlm.nih.gov/projects/SNP/
- Potentially Functional dbSNP http//www.ncbi.nlm
.nih.gov/projects/SNP/ - Previously Associated PubMed/OMIM
- http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB
pubmed - http//0-www.ncbi.nlm.nih.gov.library
.vu.edu.au/entrez/query.fcgi?dbOMIM
26SNP Picking Other Resources
- UCSC Genome Browser
- http//genome.ucsc.edu/
- SNPper
- http//snpper.chip.org/
- Seattle SNPs http//pga.gs.washington.edu/
- HapMap
- http//www.hapmap.org/
27SNP Picking Validation
28SNP Picking Validation
29SNP Picking Validation
30SNP Picking Informative
31SNP Picking Potentially Functional
C677T
32SNP Picking Previously Associated
33MTHFR Summary
- Chromosome 1 11,780,053-11,800,381
- Size 20,329 bp
- Exons 12
- Potentially Functional
- 5 missense of which 3 MAF 5
- Previously Associated
- 3 (C677T, A1298C, A2756G)
34MTHFR SNPs
http//genome.ucsc.edu/cgi-bin/hgGateway
35Analysis
Simple chi-square test comparing genotype
frequencies (2 d.f.) Called a model-free or
co-dominant analysis
36Genetic Model
ORs depend on genetic model R r 1 not risk
allele R r 1 recessive R r 1 dominant R
r2 1 log additive (Assuming positive
association)
Genotype OR GG 1 GT r TT R
37Tests of association
- If genetic model known
- Collapse genotypes into 2x2 table, 1 d.f. test
- Trend test for log additive
- (Use logistic regression)
- Rarely know genetic model
- Use all three models (dom, rec, log additive)
- Compare fit with the co-dominant (2d.f.) model
(LR test) - Cannot use LR test to compare models with each
other as not nested - Model with best fit and smallest P is best?
38(No Transcript)
39Molecular Genetic Epi 217Association Studies
Indirect
40Linkage DisequilibriumLike Shuffling a Deck of
Cards
41Too many MTHFR SNPsSolution Tag SNP Selection
- SNPs are correlated (aka Linkage Disequilibrium)
Pairwise Tagging SNP 1 SNP 3 SNP 6 3 tags in
total Test for association SNP 1 SNP 3 SNP 6
Carlson et al. (2004) AJHG 74106
42Tagging SNPs
Existence of haplotype blocks across genome
x x x x x x x
High LD ? some SNPs redundant ? tagging SNPs
recover majority of information
43Coverage Measurement Error in TagSNPs
44Common Measures of Coverage
- Threshold Measures
- e.g., 73 of SNPs in the complete set are in LD
with at least one SNP in the genotyping set at r2
0.8 - Average Measures
- e.g., Average maximum r2 0.84
45Coverage and Sample Size
- Sample size required for Direct Association, n
- Sample size for Indirect Association
- n n/ r2
- For r2 0.8, increase is 25
- For r2 0.5, increase is 100
- But n (and power) not a simple function of
- 1 / (threshold r2) or
- 1 / (average maximum r2).
Jorgenson Witte, AJHG 2006
46Tag SNPs Database Resources
http//www.hapmap.org
http//gvs.gs.washington.edu/GVS/index.jsp
47The HapMap Project
- Initial Goal
- 600,000 SNPs for indirect association studies
- LD information between SNPs
- Phase 1 1 million SNPs
- Phase 2 additional 2.9 million SNPs
48HapMap
- SNPs from dbSNP were genotyped
- Looked for 1 every 5kb
- SNP Validation
- Polymorphic
- Frequency
- Linkage Disequilibrium Estimation
- LD tagging SNPs
49HapMap
- 270 subjects
- 45 Chinese
- 45 Japanese
- 90 Yoruban and 90 European-American
- 30 Trios
- 2 parents, 1 child
50Tag SNPs HapMap
51Tag SNPs HapMap
52Tag SNPs HapMap Haploview
http//www.broad.mit.edu/mpg/haploview/
53Tag SNPs HapMap Haploview
54Tag SNPs HapMap Haploview
55Tag SNPs HapMap Haploview
56Tag SNPs HapMap Haploview
57Tag SNPs HapMap Summary
- Identified 33 common MTHR SNPs (MAF 5) among
Caucasians - Forced in 3 potentially functional/previously
associated SNPs - Identified tag based on pairwise tagging
- 15 tags SNPs could capture all 33 MTHR SNPs
(mean r2 97) - Note number of SNPs required varies from gene
to gene and from population to population
58Genome-wide Assocation Studies (GWAS)
59One- and Two-Stage GWA Designs
Two-Stage Design
One-Stage Design
SNPs
SNPs
1,2,3,,M
1,2,3,,M
1,2,3,,N
1,2,3,,N
?samples
Stage 1
Samples
Samples
Stage 2
?markers
60One-Stage Design
SNPs
Samples
Two-Stage Design
Replication-based analysis
Joint analysis
SNPs
SNPs
Samples
Stage 1
Stage 1
Samples
Stage 2
Stage 2
61Multistage Designs
- Joint analysis has more power than replication
- p-value in Stage 1 must be liberal
- Lower costdo not gain power
- http//www.sph.umich.edu/csg/abecasis/CaTS/index.h
tml
62Complex diseases
Physical activity
Genetic susceptibility
Obesity
Hyperlipidemia
Diet
Diabetes
Complex diseases Many causes many causal
pathways!
Vulnerable plaques
Hypertension
MI
Atherosclerosis
63- Pathways
- Many websites / companies provide dynamic
graphic models of molecular and biochemical
pathways. - Example BioCarta http//www.biocarta.com/
- May be interested in potential joint and/or
interaction effects of multiple genes in one
pathway.
64Interactions
- The interdependent operation of two or more
causes to produce or prevent an effect - Differences in the effects of one or more
factors according to the level of the remaining
factor(s) - Last, 2001
65Why look for interactions?
- Improve detection of genetic ( environmental)
risks. - Understand etiology/biology
- New hypotheses?
- Diagnostics
- Prevention and interventions
66Dilution of effects
Gene A
OR1.5
67Statistical vs. Biological Interactions
- Not identical.
- One hypothesizes biological interaction
- But tests for statistical interaction
- Does statistical evidence support our biological
hypothesis?
68Multiplicative vs. Additive Interactions
RER relative excess risk
69Two possible causal pathways additive and
multiplicative interaction for colorectal cancer
If factors are not known to act independently,
use multiplicative.
Brennan, P. Carcinogenesis 2002 23381-387
70Analysis of Multiple Genes
- Joint / Additive
- Multiplicative
- Increasing complexity
71More Complex Modeling
- Multifactor-dimensionality reduction
- (Moore Williams, Ann Med 2002)
- Logic regression
- (Kooperberg Ruczinski, Genetic Epi 2005)
- Multi-loci analysis
- (Marchini, Donnelly, Cardon, Nat Genet 2005)
- Bayesian epistasis association mapping
- (Zhang Liu, Nat Genet 2007)
72Pathway Analysis
- Wang et al. (AJHG 2007 in press)
- Calculate SNP associations.
- Assign each gene the min association p-value
for typed genic SNPs. - Test if genes within particular pathways have
disproportionate number of SNPs with high max
p-values. - Such candidate genes high priority.
73Incorporate Additional Information into Analysis?
- Part of a known pathway?
- Within linkage \ association regions?
- Potentially functional?
- Degree of conservation?
- Tagging other SNPs?
- Copy number polymorphism?
- One can incorporate this information with a
hierarchical model.