Title: Bayesian Inference of Epistatic Interactions in Casecontrol Studies
1Bayesian Inference of Epistatic Interactions in
Case-control Studies
- Yu Zhang Jun S Liu
- Nature Genetics 39, 1167 - 1173 (2007)
- Presented by Yixuan Chen
- 2/29/08
2Outlines
- Epistasis Background
- Methods
- The Bayesian marker partition model
- MCMC sampling
- B statistic
- Results
- Epistasis models and simulations
- Comparisons
- Genome-wide association study of AMD
- Discussions
3Epistasis
- Epistasis is a phenomenon whereby the effects of
a given gene on a biological trait are masked or
enhanced by one or more other genes. - For complex traits such as diabetes, asthma,
hypertension, etc., the presence of epistasis is
a particular cause for concern. - An increasing number of reports have indicated
the presence of multilocus interactions in many
human complex traits.
4Genome-wide Epistasis
- The number of possible interaction combinations
is astronomical. - It is a daunting task to catch one or a very
few disease-related interactions.
5BEAM Algorithm
- Bayesian Epistasis Association Mapping
- A Bayesian partitioning model
- B statistic and conditional B statistic
- BEAM is significantly more powerful.
- A genome-scale epistasis mapping is both feasible
and desirable.
6Notations
- Suppose Nd cases and Nu controls were genotyped
at L SNP markers. - D(d1,,dNd) case genotypes
- U(u1,,uNu) control genotypes
- di(di1,,diL)
- ui(ui1,,uiL)
7Marker Partitioning
- The L markers are partitioned into 3 groups
- group 0 contains markers unlinked to the disease
- group 1 contains markers contributing
independently to the disease risk - group 2 contains markers that jointly influence
the disease risk (interactions).
8Group Membership
- Let I(I1,,IL) indicate the membership of the
markers with Ij0, 1 and 2, respectively. - The goal is to infer the set j Ij gt 0.
- Let l0, l1, l2 denote the number of markers in
each group (l0 l1 l2L) - Let D0, D1 and D2 denote case genotypes of
markers in group 0, 1 and 2, respectively.
9The Bayesian marker partition model
- The likelihood model assumes independence between
markers in the control population. - The genotype frequencies of each biallelic marker
in group 1 in the disease population - The likelihood of D1 is
- nj1, nj2, nj3 are genotype counts of marker
j
10Dirichlet Distribution
- A family of continuous multivariate probability
distributions parameterized by the vector a of
positive reals - The probabilities of K rival events are xi.
11Dirichlet Prior for Group 1
- Assume a Dirichlet(a) prior for ?j1,?j2,?j3,
where a(a1,a2,a3) - Integrate out T1 and obtain the marginal
probability
12Dirichlet Prior for Group 2
- 3l2 possible genotype combinations with frequency
- nk the number of genotype combination k
- Assume a Dirichlet(ß) prior distribution of
- Integrate out T2
13Dirichlet Prior for Group 0
- Same distributions as controls
- The genotype frequencies of the L markers in the
control population - njk and mjk are the numbers of individuals with
genotype k at marker j in D and U
14Dirichlet Prior for Group 0 (c1)
- Assume Dirichlet priors for ?j with parameters
- Integrate out T
15Posterior Distribution
- The posterior distribution of I
16Markov chain Monte Carlo
- Markov chain Monte Carlo (MCMC) methods are a
class of algorithms for sampling from probability
distributions - Based on constructing a Markov chain that has the
desired distribution as its stationary
distribution. - The state of the chain after a large number of
steps is then used as a sample from the desired
distribution. - The quality of the sample improves as a function
of the number of steps.
17Metropolis-Hastings algorithm
18M-H in BEAM
- Two types of proposals are used
- Randomly change a markers group membership
- Randomly exchange two markers between groups 0, 1
and 2. - The proposed move is accepted according to the
M-H ratio, which is just a ratio of Gamma
functions. - To improve the sampling efficiency
- Set a lower bound on the number of markers in
group 2 - Gradually reduce this bound to 0 during burn-in
- This forces the algorithm to explore the space of
high-order interactions. - Also used an annealing strategy in burn-in
iterations - a temperature set high initially and gradually
reduced to 1
19The simulated data on Model 4 contains 1,000
markers from 1,000 cases and 1,000 controls, with
MAF 0.1 and marginal effect size 0.4. Run BEAM
for 150,000 burn-ins plus 200,000 samplings in
three chains, with prior p1p21/3.
20Circles denote the overall posterior
probabilities of associations, with marginal and
joint associations combined. Plus signs denote
posterior probabilities of marginal associations.
Three circles on the top correspond to the three
simulated disease markers having interaction
effects.
21B Statistic
- A hypothesis-testing procedure to check each
marker or set of markers for significant
associations - For each set M of k markers to be tested, the
null hypothesis is that markers in M are not
associated with the disease. - Here, k1,2,3, represents single-marker, two-way
and three-way interactions, etc.
22B Statistic (c1)
- Define the B statistic for the marker set M
- P0(DM, UM) and PA(DM, UM) are really the Bayes
factors - the marginal probabilities of the data with
parameters integrated out from our Bayesian model - under the null and the alternative models,
respectively
23Bayes Factor
- Given a model selection problem between two
models M1 and M2, on the basis of a data vector
x. The Bayes factor K is given by - p(x Mi) the marginal likelihood for
Mi. - Similar to a likelihood-ratio test
- instead of maximizing the likelihood
- Bayesians average it over the parameters
24B Statistic (c2)
- Choose both P0 and PA as an equal mixture of two
distributions - One that assumes independence among markers in M,
Pind(X), of the form of P(D1I) - The other a saturated joint distribution of
genotype combinations across all markers in M,
Pjoin(X), as P(D2I) - Under the null hypothesis, the B statistic is
asymptotically distributed as a shifted ?2 with
3k1 degrees of freedom
25Conditional B Statistic
- A set of k (2,3,) markers may include t(ltk)
markers that are significant through either
marginal or partial interaction associations. - The asymptotic null distribution of BMT is a
shifted ?2 with 3k3t degrees of freedom.
26Simulated Epistasis Models
27Simulated Epistasis Models (c1)
28Simulated Epistasis Models (c2)
- Model 6 is a 6-way interaction model
- Denote the genotypes of each SNP by 0, 1, and 2
- Code each genotype combination over 6 disease
loci by integers between 0728 - Assign disease effect ? 50 to genotype
combinations 4, 5, 7, 111, 114, 253, 254, 360,
387, 603, and 630. - ? 50 so that these genotype combinations can
explain a non-trivial portion (gt10) of cases.
29Simulated Epistasis Models (c3)
- 50 data sets for each disease model were
simulated under each setting - Marker minor allele frequencies (MAF) are
uniformly in 0.05, 0.5. - Each untyped disease locus is linked to one
genotyped marker - The remaining markers are unlinked
30Comparison Algorithms
- The stepwise logistic regression approach
- All markers are individually tested and ranked
for marginal associations. - The top 10 of markers are selected, among which
all k-way (k2 or 3) interactions are tested and
ranked - A ?2 test with two degrees of freedom to test for
single-marker associations. - A stepwise B-stat
- the same search strategy as stepwise logistic
regression - use the B statistic for testing significance
31Power Calculation
- Define the power of each method as the proportion
of 50 data sets in which all truly associated
markers are identified and show statistically
significant associations (adjusted P values below
0.1) with the disease.
32A Hierarchical Procedure to Declare Significance
- Marginal associations
- report all markers with significant marginal
associations after a Bonferroni correction for
the number of markers, L. - 2-way interactions
- after the Bonferroni correction for L(L-1)/2
tests - report all significant novel 2-way interactions
- Neither markers has been reported earlier
- if one marker has been reported earlier
- compute its conditional B-statistic (or the
conditional log likelihood ratio (LLR) for
logistic regression) - report the interaction if significant after a
Bonferroni correction.
33A Hierarchical Procedure to Declare Significance
(c1)
- 3-way interactions
- report all novel 3-way interactions that are
significant after a Bonferroni correction for
L(L-1)(L- 2)/6 tests - if t1 or 2 markers were already found
significant - calculate the conditional B-statistic (or the
conditional LLR) - report the interaction if it is still significant
after a Bonferroni correction. - All p-values were estimated by a chi-square
distribution with d3k3t degrees of freedom, for
k1,2,3, t0,1,2, tltk, and adjusted by Bonferroni
corrections.
34Results
- The power for detecting marginal associations was
not compromised by using the more complex models.
BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
35BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
36BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
37Results (c1)
- BEAM performs better especially when either
disease allele frequencies or marginal effects
were small. - The power of all methods decreases with the decay
of the LD (measured in r2) between disease loci
and associated markers.
38Type I Errors
- All three epistasis mapping methods made similar
amounts of type I errors. - At the 0.1 significance level, they all made 10
type I errors (after Bonferroni correction) when
searching only for marginally significant
markers. - All methods made much fewer than 10 type I
errors when searching for interactions.
39(No Transcript)
40Impact of mismatch in allele frequencies and LD
- The power of association mapping can be greatly
hampered by the discrepancy of allele frequencies
between unobserved disease loci and associated
genotyped markers. - Investigated the impact based on model 2
- MAFs at two interacting disease loci were both
0.1 - The marginal effect size per disease locus was
0.5 - One linked marker had the matched MAF, whereas
the other had an MAF ranging from 0.05 to 0.5. - The LD between disease loci and associated
markers was controlled to range from D0.7 to
D1.
41(No Transcript)
42Genome-wide association study of AMD
- The AMD (age-related macular degeneration) data
set contains 116,204 SNPs genotyped for 96
affected individuals and 50 controls. - Remove nonpolymorphic SNPs and those that
significantly deviated from Hardy-Weinberg
Equilibrium (HWE) - Remove additional SNPs containing more than five
missing genotypes. - After the filtration, 96,932 SNPs remained.
43(No Transcript)
44Prior Calibration
- With only 146 individuals and 100,000 SNPs, the
posterior probability of associations for each
marker is strongly influenced by the choice of
priors, although the order of these probabilities
is nearly invariant.
45(No Transcript)
46Simulation Based on AMD Data
47Comparison with other epistasis mapping approaches
- MDR identifies k-way interactions through an
exhaustive search and evaluates the association
between each interaction and the disease by
cross-validations. - Logic regression infers a tree-based relationship
between the disease status and a set of
markers.It evaluates the detected associations by
permutation tests. - BGTA uses a bootstrap-type resampling screening
procedure to select markers, and those markers
with return frequencies greater than the third
quartile plus 1.8 times the interquartile range
are deemed disease-associated markers.
48(No Transcript)
49Discussions
- The BEAM algorithm has two essential components
- a Bayesian epistasis inference tool implemented
via MCMC - a novel test statistic for evaluating statistical
significance - A natural advantage of the Bayesian approach
- incorporate prior knowledge about each marker
- quantify all information and uncertainties in the
form of posteriors - Evaluating the statistical significance of a
candidate finding via P values - more robust to model choice and prior assumptions
- can give the scientist peace of mind
50Discussions (c1)
- The power of epistasis mappings depends
critically on - sample size
- effects of disease mutations
- any discrepancy in allele frequencies between
disease loci and associated markers - There are several issues that may affect the
accuracy - population substructures
- genotyping errors
- disease heterogeneities
51THANK YOU