Title: Pitfalls in Genetic Association Studies [M.Tevfik DORAK]
1Pitfalls in Genetic Association Studies M.
Tevfik DORAK Paediatric and Lifecourse
Epidemiology Research Group Sir James Spence
Institute Newcastle University, U.K.
Clinical Studies Objective Medicine Bodrum,
15-16 April 2006
2Incident or prevalent cases Comparable controls
or convenience samples Population based?
Confounding by ethnicity / Population
stratification Confounding by locus / Linkage
disequilibrium
Statistical power Multiple comparisons
No adjustment for known associations Effect
modification by sex Different genetic models
Overinterpretation LD ruled out? Biological
plausibility (a priori hypothesis)?
Replication / Consistency Publication bias
3Tabor, Risch Myers. Nat Rev Genet 2002 (www)
4Internal Validity of an Association Study Avoid
"BIAS" Control for "CONFOUNDING" Rule out
"CHANCE"
5Internal Validity of an Association Study Avoid
"BIAS" Be careful! Control for "CONFOUNDING" Match
ing, Stratification, MV Analysis Rule out
"CHANCE" Large sample, Replication
6 Special Kinds of Confounding in Genetic
Epidemiology CONFOUNDING by Locus (LD) MV
Analysis, LD Analysis CONFOUNDING by Ethnicity
(Population Stratification) Matching,
Stratification, MV Analysis Family-Based
Association Studies Genomic Controls and
Specially Designed GE Analysis
7 Special Kinds of Confounding in Genetic
Epidemiology CONFOUNDING by Locus (LD) MV
Analysis, LD Analysis CONFOUNDING by Ethnicity
(Population Stratification) Matching,
Stratification, MV Analysis Family-Based
Association Studies Genomic Controls and
Specially Designed GE Analysis
8Mapping Disease Susceptibility Genes by
Association Studies
Plot of minus log of P value for case-control
test for allelic association with AD, for SNPs
immediately surrounding APOE (lt100 kb)
Martin, 2000 (www)
9Example Linkage Disequilibrium
HLA-B47 association with congenital adrenal
hyperplasia (Dupont et al, Lancet 1977) HLA-B14
association with late-onset adrenal hyperplasia
(Pollack et al, Am J Hum Genet 1981) Is
congenital adrenal hyperplasia an immune
system-mediated disease?
10Example Linkage Disequilibrium
HLA-B47 association with congenital adrenal
hyperplasia is due to deletion of CYP21A2 on
HLA-B47DR7 haplotype HLA-B14 association with
late-onset adrenal hyperplasia is due to an exon
7 missense mutation (V281L) in CYP21A2 on
HLA-B14DR1 haplotype
11Preliminary evidence of an association between
HLA-DPB10201 and childhood common ALL supports
an infectious aetiology Leukemia
19959(3)440-3
Evidence that an HLA-DQA1-DQB1 haplotype
influences susceptibility to childhood common ALL
in boys provides further support for an
infection-related aetiology Br J Cancer
199878(5)561-5
Why not LD?
12(No Transcript)
13Publication Bias
Negative studies do not get published - NIH
Genetic Associations Database ?
A different kind of publication bias?
Preliminary evidence of an association between
HLA-DPB10201 and childhood common ALL supports
an infectious aetiology Leukemia
19959(3)440-3
Evidence that an HLA-DQA1-DQB1 haplotype
influences susceptibility to childhood common ALL
in boys provides further support for an
infection-related aetiology Br J Cancer
199878(5)561-5
14 Special Kinds of Confounding in Genetic
Epidemiology CONFOUNDING by Locus (LD) MV
Analysis, LD Analysis CONFOUNDING by Ethnicity
(Population Stratification) Matching,
Stratification, MV Analysis Family-Based
Association Studies Genomic Controls and
Specially Designed GE Analysis
15Wacholder, 2002 (www)
16Population Stratification
Marchini, 2004 (www)
17Cardon Palmer, 2003 (www)
18Internal Validity of an Association Study Avoid
"BIAS" Be careful! Control for "CONFOUNDING" Match
ing, Stratification, MV Analysis Rule out
"CHANCE" Large sample, Replication
19Example Functional Correlation
20(No Transcript)
21Example Replication
HFE-C282Y Association in Childhood
ALL
SCOTTISH GROUP 135 patients - 238 newborns P
0.0004 OR 3.0 (1.7 to 5.4) In cALL P lt
0.0001 OR 4.7 (2.5 to 8.9)
WELSH GROUP 117 patients - 415 newborns P
0.005 OR 2.8 (1.4 to 5.4) In cALL P 0.02
OR 2.9 (1.4 to 6.4)
Dorak et al, Blood 1999
22Palmer LJ. Webcast (www)
23Multiple Comparisons Spurious Associations
Diepstra, Lancet 2005 (www)
24Genetic Models and Case-Control Association Data
Analysis The data may also be analysed assuming
a prespecified genetic model. For example, with
the hypothesis that carrying allele B increased
risk of disease (dominant model), the AB and BB
genotypes are pooled giving a 2x3x2 table. This
is particularly relevant when allele B is rare,
with few BB observations in cases and controls.
Alternatively, under a recessive model for allele
B, cells AA and AB would be pooled. Analysing by
alleles provides an alternative perspective for
case control data. This breaks down genotypes to
compare the total number of A and B alleles in
cases and controls, regardless of the genotypes
from which these alleles are constructed. This
analysis is counter-intuitive, since alleles do
not act independently, but it provides the most
powerful method of testing under a multiplicative
genetic model, where risk of developing a
disease increases by a factor r for each B allele
carried risk r for genotype AB and r2 for
genotype BB. If a multiplicative genetic model is
appropriate, both case and control genotypes
will be in HardyWeinberg equilibrium, and this
can be tested for. A fourth possible genetic
model is additive, with an increased disease risk
of r for AB genotypes, and 2r for BB genotypes.
This model shows a clear trend of an increased
number of AB and BB genotypes, with the risk for
AB genotypes approximately half that for BB
genotypes. The additive genetic model can be
tested for using Armitages test for trend.
Lewis CM. Brief Bioinform 2002 (www)
25 HLA-DRB4 Association in Childhood
ALL Homozygosity for HLA-DRB4 family is
associated with susceptibility to childhood ALL
in boys only (P lt 0.0001, OR 6.1, 95 CI
2.9 to 12.6 ) Controls are an unselected group
of local newborns (201 boys 214 girls)
Case-only analysis P 0.002 (OR 5.6 95 CI
1.8 to 17.6) This association extends to a
DRB4-HSP70 haplotype (OR 8.3 95 CI 3.0 to
22.9) This association has been replicated in
Scotland and Turkey
Girls, n53
Boys, n64
26HLA-DRB4 ASSOCIATION
ADDITIVE MODEL Linear Model Logit estimates
Number of obs
265
LR chi2(1) 14.24
Prob gt chi2
0.0002 Log likelihood -139.37794
Pseudo R2
0.0486 -------------------------------------------
----------------------------------- caco
Common Odds Ratio Std. Err. z Pgtz
95 CI -------------------------------------
----------------------------------------
drb4add 2.208651 .4734163 3.70
0.000 1.45103 - 3.36186 ------------------------
--------------------------------------------------
---- Heterozygosity and Homozygosity Logit
estimates
Number of obs 265
LR chi2(2)
22.00
Prob gt chi2 0.0000 Log
likelihood -135.49623
Pseudo R2 0.0751 ---------------------
--------------------------------------------------
------- caco Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval -------------
-------------------------------------------------
--------------- Wild-type 1.00
(ref) Heterozygosity 1.060652 .3557426
0.18 0.861 .549642 2.04676 Homozygosity
6.258503 2.65464 4.32 0.000
2.72534 14.37211 -------------------------------
-----------------------------------------------
27HLA-DRB4 - HSPA1B HAPLOTYPE ASSOCIATION
EFFECT MODIFICATION Logit estimates
Number of obs
532
LR chi2(3) 23.97
Prob gt chi2
0.0000 Log likelihood -268.27826
Pseudo R2
0.0428 -------------------------------------------
----------------------------------- caco
Coef. Std. Err. z Pgtz 95
Conf. Interval ---------------------------------
--------------------------------------------
sex -.0299037 .2229554 -0.13 0.893
-.4668883 .4070808 hsp53 2.530033
.5929603 4.27 0.000 1.367852
3.692214 _IsexXhsp52 -2.758189 .8812645
-3.13 0.002 -4.485436 -1.030943
_cons -1.321474 .3517969 -3.76 0.000
-2.010984 -.6319651 ----------------------------
--------------------------------------------------
CONFOUNDING BY SEX Logit estimates
Number of obs
532
LR chi2(2) 11.99
Prob gt chi2
0.0025 Log likelihood -274.26995
Pseudo R2
0.0214 -------------------------------------------
----------------------------------- caco
Odds Ratio Std. Err. z Pgtz 95
Conf. Interval ---------------------------------
--------------------------------------------
hsp53 3.32777 1.191429 3.36 0.001
1.649684 6.712832 sex .7693041
.1636106 -1.23 0.218 .5070725
1.167148 -----------------------------------------
------------------------------------- Adjusted
for sex?
28HLA-DRB4 - HSPA1B HAPLOTYPE ASSOCIATION
BOYS ONLY Logit estimates
Number of obs 265
LR
chi2(1) 22.41
Prob gt chi2
0.0000 Log likelihood -135.29119
Pseudo R2 0.0765 --------------
--------------------------------------------------
-------------- caco Odds Ratio Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
--------------------------------------
hsp53 12.55392 7.444028 4.27 0.000
3.926876 40.13392 ----------------------------
--------------------------------------------------
GIRLS ONLY Logit estimates
Number of obs 267
LR
chi2(1) 0.13
Prob gt chi2
0.7205 Log likelihood -132.98706
Pseudo R2 0.0005 --------------
--------------------------------------------------
-------------- caco Odds Ratio Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
--------------------------------------
hsp53 0.796 .5189439 -0.35 0.726
.22181 2.856571 ----------------------------
--------------------------------------------------
The association is modified by sex
29Cardon Bell. Nat Rev Genet 2001 (www)
30Current Criteria for Good Association Studies
31Cardon Bell. Nat Rev Genet 2001 (www)
32Statistical checklist for genetic association
studies - In a case-control study Cases and
controls derive from the same study base There
are more controls than cases (up to 5-to-1, for
increased statistical power) There are at least
100 cases and 100 controls - Statistical power
calculations are presented - Hardy-Weinberg
equilibrium (HWE) is checked and appropriate
tests are used - If HWE is violated, allelic
association tests are not used - Possible
genotyping errors and counter-measures are
discussed - All statistical tests are
two-tailed - Alternative genetic models of
association considered - The choice of
marker/allele/genotype frequency (for
comparisons) is justified - For HLA associations,
a global test for association (G-test, RxC exact
test) for each locus is used (if necessary, with
correction for multiple testing) - Chi-squared
and Fisher tests are NOT used interchangeably - P
values are presented without spurious accuracy
(with two decimal places) - Strength of
association has been measured (usually odds ratio
and its 95 CI) - In a retrospective case-control
study, ORs are presented (as opposed to RRs) -
Multiple comparisons issue is handled
appropriately (this does not necessarily mean
Bonferroni corrections) - Alternative
explanations for the observed associations
(chance, bias, confounding) are discussed
http//www.dorak.info/hla/stat.html
33Multifactorial Etiology
ROCHE Genetic Education (www)
34Models of geneenvironment interactions
Hunter, 2005 (www)
35Generating Protein Diversity from the 'Small'
Genome
Banks, 2000 (www)
36Generating Protein Diversity from the 'Small'
Genome
Alternative Splicing Can Generate Very Large
Numbers of Related Proteins From a Single Gene
Most extreme example is the Drosophila Dscam
Gene
12 x 48 x 33 x 2 38,016 alternative splice
variants
Black, Cell 2000 (www) Wojtowicz, Cell 2004 (www)
DSCAM Down syndrome cell adhesion molecule
37Generating Protein Diversity from the 'Small'
Genome
Alternative Splicing Can be Tissue or
Cell-Specific
Lodish et al. Molecular Cell Biology, 5th Ed, WH
Freeman (www)
38Generating Protein Diversity from the 'Small'
Genome
mRNA editing (base modification) is a different
mechanism of alternative splicing
Lodish et al. Molecular Cell Biology, 5th Ed, WH
Freeman (www)
OMIM 107730 (www) Chen Chan, 1996
(www) Wedekind, 2003 (www) RNA Editing in The
Cell NCBI Online (www)
39 Integration of proteomics in genetic
epidemiology studies would eliminate a lot of
obstacles arising from the following Only lt2 of
the genome is protein-coding and most sequence
variants are silent changes Even genome-wide
sequence variant studies cannot identify the
genomic counterparts of 1.5 million
proteins Epigenetic changes, alternative
transcription/splicing and posttranslational
modifications cannot be predicted by study of
sequence variants Genomic DNA studies does not
take into account selective expression of genes
in certain cell or tissues No genetic association
is complete without demonstration of the
functional relevance
40 50s Rule Genome - 150 coding GeneProtein -
150 ratio Overall efficiency of pure genomic
studies 12500
41http//www.dorak.info
42(No Transcript)