Title: Association Analysis of Rare Genetic Variants
1Association Analysis of Rare Genetic Variants
- Qunyuan Zhang
- Division of Statistical Genomics
- Course M21-621
- Computational Statistical Genetics
2 Rare Variants
- Low allele frequency usually less than 1
- Low power for most analyses, due to less
variation of observations - High false positive rate for some model-based
analyses, due to sparse distribution of data,
unstable/biased parameter estimation and inflated
p-value.
2
3An Example of Low Power
3
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
4An Example of High False Positive Rate(Q-Q plots
from GWAS data, unpublished)
N2500 MAFgt0.03
N2500 MAFlt0.03
N50000 MAFlt0.03 Bootstrapped
N2500 MAFlt0.03 Permuted
5 Three Levels of Rare Variant Data
- Level 1 Individual-level
- Level 2 Summarized over subjects
- Level 3 Summarized over both subjects and
variants
5
6Level 1 Individual-level
Subject V1 V2 V3 V4 Trait-1 Trait-2
1 1 0 0 0 90.1 1
2 0 1 0 . 99.2 1
3 0 0 0 0 105.9 0
4 0 0 0 0 89.5 0
5 0 . 0 0 97.6 0
6 0 0 0 0 110.5 0
7 0 0 1 0 88.8 0
8 0 0 0 1 95.4 1
6
7Level 2 Summarized over subjects (by group)
7
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
8Level 3 Summarized over subjects (by group) and
variants (usually by gene)
Variant allele number Reference allele number Total
Low-HDL group 20 236 256
High-HDL group 2 254 256
Total 22 490 512
9Methods For Level 3 Data
9
10Single-variant Test vs Total Freq.Test (TFT)
Jonathan C. Cohen, et al.
Science 305, 869 (2004)
11What we have learned
- Single-variant test of rare variants has very low
power for detecting association, due to extremely
low frequency (usually lt 0.01) - Testing collective effect of a set of rare
variants may increase the power (sum test,
collective test, group test, collapsing test,
burden test)
12Methods For Level 2 Data
- Allowing different samples sizes for different
variants - Different variants can be weighted differently
12
13CAST A cohort allelic sums test
Morgenthaler and Thilly, Mutation Research 615
(2007) 2856
Under H0 S(cases)/2N(cases)-S(controls)/2N(contro
ls) 0 S variant number N sample size T
S(cases) - S(controls)N(cases)/N(controls)
S(cases) - S(controls) (S can be calculated
variant by variant and can be weighted
differently, the final Tsum(WiSi)
) ZT/SQRT(Var(T)) N (0,1) Var(T) Var
(S(cases) - S (controls) ) Var(S(cases))
Var(S (controls)) Var(S(cases))
Var(S(controls)) X N(cases)/N(controls)2
13
14C-alpha
PLOS Genetics, 2011 Volume 7 Issue 3
e1001322
Effect direction problem
15C-alpha
15
16QQ Plots of Existing Methods (under the null)
- EFT and C-alpha
- inflated with false positives
- TFT and CAST
- no inflation, but assuming single
- effect-direction
- Objective
- More general, powerful methods
EFT TFT
CAST C-alpha
17More Generalized Methods For Level 2 Data
17
18Structure of Level 2 data
variant 1
variant 2
variant k
variant 3
variant i
Strategy Instead of testing total freq./number,
we test the randomness of all tables.
19Exact Probability Test (EPT)
1.Calculating the probability of each table based
on hypergeometric distribution
2. Calculating the logarized joint
probability (L) for all k tables
3. Enumerating all possible tables and L
scores
4. Calculating p-value P Prob.( )
ASHG Meeting 1212, Zhang
20Likelihood Ratio Test (LRT)
Binomial distribution
ASHG Meeting 1212, Zhang
21Q-Q Plots of EPT and LRT(under the null)
EPT N500
LRT N500
LRT N3000
EPT N3000
22Power Comparison significance level0.00001
Variant proportion Positive causal 80 Neutral
20 Negative Causal 0
Power
Power
Power
Sample size
Sample size
Sample size
23Power Comparison significance level0.00001
Variant proportion Positive causal 60 Neutral
20 Negative Causal 20
Power
Sample size
24Power Comparison significance level0.00001
Variant proportion Positive causal 40 Neutral
20 Negative Causal 40
Power
Sample size
25Methods For Level 1 Data
- Including covariates
- Extended to quantitative trait
- Better control for population structure
- More sophisticate model
25
26Collapsing (C) test
Li and Leal,The American Journal of Human
Genetics 2008(83) 311321
Step 1
Step 2 logit(y)a b X e (logistic
regression)
27Variant Collapsing
() () (.) (.)
Subject V1 V2 V3 V4 Collapsed Trait
1 1 0 0 0 1 1
2 0 1 0 0 1 1
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 1 0 1 0
8 0 0 0 1 1 1
28WSS
29WSS
29
30WSS
30
31Weighted Sum Test
Collapsing test (Li Leal, 2008), wi 1 and
s1 if sgt1 Weighted-sum test (Madsen Browning
,2009), wi calculated based-on allele freq. in
control group aSum Adaptive sum test (Han Pan
,2010), wi -1 if blt0 and plt0.1, otherwise
wj1 KBAC (Liu and Leal, 2010), wi left tail
p value RBT (Ionita-Laza et al, 2011), wi log
scaled probability PWST p-value weighted sum
test (Zhang et al., 2011) , wi rescaled left
tail p value, incorporating both significance and
directions EREC( Lin et al, 2011), wi
estimated effect size
31
32() ()
Subject V1 V2 Collapsed Trait
1 1 0 1 3.00
2 0 1 1 3.10
3 0 0 0 1.95
4 0 0 0 2.00
5 0 0 0 2.05
6 0 0 0 2.10
When there are only causal() variants
Collapsing (Li Leal,2008) works well, power
increased
32
33() () (.) (.)
Subject V1 V2 V3 V4 Collapsed Trait
1 1 0 0 0 1 3.00
2 0 1 0 0 1 3.10
3 0 0 0 0 0 1.95
4 0 0 0 0 0 2.00
5 0 0 0 0 0 2.05
6 0 0 0 0 0 2.10
7 0 0 1 0 1 2.00
8 0 0 0 1 1 2.10
When there are causal() and non-causal(.)
variants
Collapsing still works, power reduced
33
34() () (.) (.) (-) (-)
Subject V1 V2 V3 V4 V5 V6 Collapsed Trait
1 1 0 0 0 0 0 1 3.00
2 0 1 0 0 0 0 1 3.10
3 0 0 0 0 0 0 0 1.95
4 0 0 0 0 0 0 0 2.00
5 0 0 0 0 0 0 0 2.05
6 0 0 0 0 0 0 0 2.10
7 0 0 1 0 0 0 1 2.00
8 0 0 0 1 0 0 1 2.10
9 0 0 0 0 1 0 1 0.95
10 0 0 0 0 0 1 1 1.00
When there are causal() non-causal(.) and
causal (-) variants
Power of collapsing test significantly down
34
35P-value Weighted Sum Test (PWST)
() () (.) (.) (-) (-)
Subject V1 V2 V3 V4 V5 V6 Collapsed pSum Trait
1 1 0 0 0 0 0 1 0.86 3.00
2 0 1 0 0 0 0 1 0.90 3.10
3 0 0 0 0 0 0 0 0.00 1.95
4 0 0 0 0 0 0 0 0.00 2.00
5 0 0 0 0 0 0 0 0.00 2.05
6 0 0 0 0 0 0 0 0.00 2.10
7 0 0 1 0 0 0 1 -0.02 2.00
8 0 0 0 1 0 0 1 0.08 2.10
9 0 0 0 0 1 0 1 -0.90 0.95
10 0 0 0 0 0 1 1 -0.88 1.00
t 1.61 1.84 -0.04 0.11 -1.84 -1.72
p(xt) 0.93 0.95 0.49 0.54 0.05 0.06
2(p-0.5) 0.86 0.90 -0.02 0.08 -0.90 -0.88
Rescaled left-tail p-value -1,1 is used as
weight
35
36P-value Weighted Sum Test (PWST)
Power of collapsing test is retained even there
are bidirectional effects
36
37PWSTQ-Q Plots Under the Null
Direct test Inflation of type I error
Corrected by permutation test (permutation of
phenotype)
37
38Generalized Linear Mixed Model (GLMM) Weighted
Sum Test (WST)
38
39GLMM WST
Y quantitative trait or logit(binary trait) a
intercept ß regression coefficient of weighted
sum m number of RVs to be collapsed wi
weight of variant i gi genotype (recoded) of
variant i Swigi weighted sum (WS) X
covariate(s), such as population structure
variable(s) t fixed effect(s) of X Z design
matrix corresponding to ? ? random polygene
effects for individual subjects, N(0, G),
G2s2K, K is the kinship matrix and s2 the
additive ploygene genetic variance
e residual
39
40Weight
- Base on allele frequency, binary(0,1) or
continuous, fixed or variable threshold - Based on function annotation/prediction SIFT,
PolyPhen etc. - Based on sequencing quality (coverage, mapping
quality, genotyping quality etc.) - Data-driven, using both genotype and phenotype
data, learning weight from data or adaptive
selection, permutation test - Any combination
40
41 Application 1 Family Data
- Adjusting relatedness in family data for
non-data-driven test of rare variants.
Unadjusted
Adjusted
? N(0,2s2K)
41
42- Q-Q Plots of log10(P) under the Null
Li Leals collapsing test, ignoring family
structure, inflation of type-1 error
Li Leals collapsing test, modeling family
structure via GLMM, inflation is corrected
(From Zhang et al, 2011, BMC Proc.)
42
43Application 2 Permuting Family Data
MMPT Mixed Model-based Permutation
Test Adjusting relatedness in family data for
data-driven permutation test of rare variants.
? N(0,2s2K)
43
44WSS
Permutation test, ignoring family structure,
inflation of type-1 error
aSum
SPWST
PWST
44
(From Zhang et al, 2011, IGES Meeting)
45WSS
Mixed model-based permutation test (MMPT),
modeling family structure, inflation corrected
aSum
SPWST
PWST
(From Zhang et al, 2011, IGES Meeting)
46Burden Test vs. Non-burden Test
Burden test
Non-burden test
T-test, Likelihood Ratio Test, F-test, score
test,
SKAT sequence kernel association test
46
47SKAT sequence kernel association test
48Extension of SKAT to Family Data
kinship matrix
Polygenic heritability of the trait
Residual
Han Chen et al., 2012, Genetic Epidemiology
49Other problems
- Missing genotypes imputation
- Genotyping errors QC (family consistency,
sequence review) - Population Stratification
- Inherited variants and de novo mutation
- Family data linkage infomation
- Variant validation and association validation
- Public databases
- And more
49