Gene Mapping with Bayesian Variable Selection and MCMC

About This Presentation

Title:

Gene Mapping with Bayesian Variable Selection and MCMC

Description:

Implemented in Stochastic Search Gene Suggestion (SSGS) Bayesian Model Selection ... Extends Stochastic Search Variable Selection (George and McCulloch, 1993) ... – PowerPoint PPT presentation

Number of Views:243

Avg rating:3.0/5.0

Slides: 51

Provided by: michaelda3

Category:

more less

Transcript and Presenter's Notes

Title: Gene Mapping with Bayesian Variable Selection and MCMC

1
Gene Mapping with Bayesian Variable Selection and
MCMC

Michael Swartz
mswartz_at_stat.tamu.edu

2
Outline

Intro to Genetics
Intro to Gene mapping, Association studies
The Conditional logistic regression model for
Gene mapping
Bayesian Model Selection
Stochastic Search Variable Selection
Stochastic Search Gene Suggestion (SSGS)
Performance on Simulated Data
SSGS vs the MLE.

3
Intro to Genetics
4
Picture book of Genetics
Gene A specific coding region of DNA
Chromosomes Line up genes
?Locus a genes position
Alleles
Haplotype One
Genotype Both
Molecular Marker A polymorphic locus with a
known position on the chromosome
5
Linkage
Linkage

Violates Mendels Second law Genes segregate
independently

Allows us to measure genetic distance

Biological source of linkage Meiosis -- the
process of cell division that produces haploid
gametes.

Genes that co-segregate in the recombinant
gametes are linked.

6
Linkage Disequilibrium

Association of alleles in a population

7
Gene Mapping Association Studies
8
Data The Case-Parent Triad
Collect Haplotype information on the Parents (G)
as well as the case (g) so we have information
about the transmitted and non transmitted
haplotypes. Model the probability of transmission.
9
Gene Mapping By Association

Transmission Disequilibrium Test (TDT)
Uses transmitted and non-transmitted alleles in
case parent triads to jointly test for linkage
and linkage disequilbrium
Based on McNemars test for case-control data
Tests for association between two loci at a time
Log-linear models
Also used for case-control data
TDT triads can be modeled with Conditional
Logistic Regression for case control data. (Self,
et al, 1991, Thomas, et al., 1995)
Extends the TDT to multiple loci

10
Advantages to a log-linear model

Using a Bayesian model we can incorporate
genetic association between the markers.
Easy to analyze multiple loci
Easy to consider Gene X Gene interactions
Easy to consider haplotypes
Easy to consider environmental effects
Easy to consider Gene X Envrionment effects

11
Advantages to a log-linear model

Using a Bayesian model we can incorporate genetic
association between the markers.
Easy to analyze multiple loci
Easy to consider Gene X Gene interactions
Easy to consider haplotypes
Easy to consider environmental effects
Easy to consider Gene X Envrionment effects

12
Coding the Triads (Thomas et al., 1995 Schaid
1996)

Ex 3 diallelic loci.
Recall gip and GTip from the case-parent triad.
For the Logistic Regression model we use Zi
gimgif.
This is known as GTDT coding scheme (Schaid 1996)
Using Haplotypes in Conditional Logistic
Regression is one way to examine Complex Diseases
using Triads

13
Sampling Distribution for Triads
14
The Sampling distribution a Conditional
Logistic Function (Thomas et al., 1995, Self et
al., 1991)
where G is the set of all possible transmitted
genotypes given the parents genotypes
(Pseudo-Controls)
and
15
Identifiability for Conditional Logistic
Regression Parameters

Gene Mapping with Conditional Logistic Regression
(CLR) uses categorical covariates (genotpye or
haplotype)
For identifiability, we must define a reference
category for each locus
Choose the most prevalent allele at each locus as
its reference allele.

16
Calculating Prevalence from Triads (Thomas, 1995)

Let Cla denote the number of haplotypes in the
case that carry allele a at locus l.
Likewise, let Pla denote the number of haplotypes
in the parents that carry allele a at locus l.
If N denotes the total number of triads, then the
prevalence of allele a at locus l can be
calculated by (Pla Cla)/2N

17
Using CLR to infer genes

Frequentist
Make Inference on the Maximum Likelihood
Estimates for the ? parameters in the CLR model.
Requires numerical optimization
Prepackaged in STATA clogit command.
Bayesian
Calculate Posterior Distribution and make
inference from the appropriate summaries
Requires Markov Chain Monte Carlo posterior
simulation
Implemented in Stochastic Search Gene Suggestion
(SSGS)

18
Bayesian Model Selection
19
Hierarchical Bayesian setup for Variable Selection

Use a Hierarchical Bayesian method

? is an indicator vector of the variables, and
?(?) is the vector of coefficients for model ?.

Make inferences from the variable posterior

20
Advantages to Bayesian Hierarchical Modeling

Account for prior information
Allow for Bayesian Variable Selection Techniques
Make inference from model posterior
No multiple testing because discussing pure
probabilities

21
Stochastic Search Variable Selection(George and
McCulloch, 1993)

Linear Regression Introduce a latent variable to
indicate covariates importance.
Hierarchy allows prior information to enter the
model and be updated by the data
Likelihood Y?,?2 Nn(X?, ?2I)
Model Prior ? Binomial(p)
Parameter Priors
? ? Np(0,D?R D ?)
?2? IG(??/2, ????/2) ? ????/?2

22
Stochastic Search Variable Selection(Continued)

Full Conditionals for ?, ?, and ?2 recognizable
?Gibbs Sampling
Generalized to Various GLMs (George, McCulloch,
and Tsay, 1996 Ntzoufras, Forster, and
Dellaportas, 2000 and a few others).

23
Stochastic Search Gene Suggestion

Extends Stochastic Search Variable Selection
(George and McCulloch, 1993)
Introduces two latent variables to indicate a
genes importance in the model one for loci and
one for alleles.
Induces a hierarchy that allows prior information
about genes to enter the model
Genetic structure
Genetic correlation
The hierarchical nature allows the data to update
the probability of including a particular gene

24
Priors for Gene Suggestion

Use two priors for gene suggestion
One indicator vector for locus selection
?(?1,,?L),

where pl P(Locus l is associated with the
disease)

One indicator vector for allele selection given
each locus ?. Each element ?la pertains to a
particular allele at locus l.

where qla P(Allele a at locus l causes disease)
25
Prior for allele main effects ?(??,?)Allelic
dependence in model selection

Prior for main effects models the genetic
dependencies between loci and alleles

where
with each kla defined as
26
How SSGS works

Exploits MVN Covariance matrix D?RD? (George
and McCulloch, 1993)
If ? 0, then ?la focuses the probability of ?la
around 0
if ? 1, then ?lacla expands the probability of
?la to cover reasonable values
Automatic methods for choosing ? and c in paper
Subjectively
choose ?la such that -3?la lt ?la lt 3?la implies
?la 0
choose cla such that 3?lacla covers reasonable
values for ?la
Model information contained in P(? Data)
R based on Linkage Disequilbrium can be helpful
for gene mapping

27
The Prior Covariance Matrix

Define the Diagonal Blocks lili using the
covariance for a multinomial distribution using
allele frequencies assuming they are constant
across generation.
Determine the off-diagonal blocks lilj?i?j
using the allelic disequilibirium between the
alleles at locus i and locus j
.
Define R L-1

28
Sampling from the Posterior

No full conditional for updating ?
Use Hybrid Gibbs sampling and Metropolis-Hastings
Algorithm to construct a Markov Chain.
Full conditionals for updating ? and ?
Metropolis Hastings acceptance ratio for updating
? by locus
For a given model, sample repeatedly from
Metropolis Hastings before proposing a new model
Even model iterations generated by independence
MLE proposal
Odd model iterations generated by random walk
proposal

29
Gibbs Sampling Component

P(?i1 ?(-i), ?, ?, g, Gm, Gf) P(?i1 ?(-i),
?) a1/(a0a1)
a1 f(? ?i1, ?(-i), ?)f(?(-i), ?i1)
a0 f(? ?i0, ?(-i), ?)f(?(-i), ?i0)
P(?i1 ?(-i), ?, ?, g, Gm, Gf) P(?i1?(-i),
?) b1/(b0b1)
b1 f(? ?i1, ?(-i), ?)f(?(-i), ?i1)
b0 f(? ?i0, ?(-i), ?)f(?(-i), ?i0)

30
Metropolis Hastings Component (by locus)
MH Ratio

Two different proposal Distributions
MLE independence proposal conditional on other
loci
Random Walk symmetric proposal conditional on
other loci

31
SSGS Flow Chart
32
Finding Genes

Using a Bayesian Model, we simply summarize the
posterior in a meaningful way
The MCMC sample is a large sample from our
posterior
Thus we can summarize genes importance by using
the marginal posterior probability of inclusion
for each gene
Use the median model threshold P(la) gt .5

33
Simulating Data
34
Simulated Data

Used genetic data simulated for Genetic Analysis
Workshop 12 (GAW12)
Used Chromosome 1 from isolated population
Microsatellite markers simulated 1 cM apart, with
4-16 alleles
Simulated without influence from selection
reference Wijsman, E.M. Almasy, L., Amos, C.I.,
Borecki, I., Falk C.T., King, T.M., Martinez, M.
M., Meyers, D., Neuman, R., Olson, J.M., Rich,
S., Spence, M.A., Thomas, D. C., Vieland, V.J.,
Witte, J. S., MacCluer, J.W. (2001) Genetic
Analysis Workshop 12 Analysis of Complex Genetic
Traits Applications to Asthma and Simulated
Data. Genet Epidemiol 21(supp 1)S1-S853

35
Using GAW 12 Data Model Simulation

Simulate directly from model
Use the conditional logistic regression function
to determine probability of transmission of the
genes
The parents determine the 4 possible children
Treat each child as a category in a multinomial
distribution
Calculate the probability of each child using a
conditional logistic regression function with
specified ?s
Draw 1 sample from the corresponding multinomial
distribution to determine the affected genotype
for the triad.
Know the right answers for ?
Analyze the data twice
Independent R I
Dependent R based on HWE LD

36
Simulation 1 Model Simulation

3 loci with a total of 20 alleles, close together
A14 A211 A35
GAW 12 Chromosome 1 Loci 9, 11, and 12
Genetic Covariance Present
Average D for 3 loci span from 0.133 to 0.256
90 of ? ?0.005,0.386 median 0.012
True Model g2, g14, g16
True Betas ?22.74, ?143.63, ?16 4.39
?-(2,14,16)0
200,000 iterations

37
Running STATA

Data was collected in Triad
STATA needs pseudocontrols enumerated
Assuming no recombination, construct each Z
vector (sum of the haplotypes) of the possible
children given the parents
Obtain MLE and confidence intervals Run clogit
on the data stratified by family (only the 4
children are present in each stratification)

38
Preparing for SSGS

Label the haplotypes in the parents as
transmitted or non transmitted
Calculate the MLEs and Fishers information
using STATA to define the proposal distribution
for even iterations
Define the initial values for
? (mle)
? ( l)
? ( 1)

39
Simulation 1 Model Simulation

Independent Prior
p q 0.5
? 0.2, c 10
None of the ?s failed the Heidelberger and Welch
test for stationarity
Total models visited 302

Dependent Prior
p q 0.5
? 0.2, c 10
None of the ?s failed the Heidelberger and Welch
test for stationarity
Total models visited 6046

40
Simulation 1 Suggested Genes
41
Simulation 1 Estimation Intervals
42
Using GAW 12 Data Disease Simulation

Simulate a disease
Pick alleles at a marker that cause the disease
Simulate disease based on a determined
penetrance,(P(Dgenes)) sporadic risk
(P(Dnormal), and dominance
Know which alleles should be suggested by SSGS,
but not the true ?
Analyze the data twice
Dependent R based on HWE LD
Independent R I

43
Simulation 2 Simulated Disease

3 loci from GAW 12 chromosome 1 Locus 1 A16,
Locus 2 A28, Locus 8 A8 4
Genetic Correlation
Average D values span from 0.084 to 0.29
90 of ? ?0.0003,0.259 median 0.005
Penetrances
P(DL1a3,L1a3) 0.4
P(DL8a2,L8a2) 0.6
P(DL8a4,L8a4) 0.4
P(DL8a2,L8a4) 0.5
P(Dany other genes) 0.05
True model g3, g14, g15
200,000 iterations

44
Simulation 2 Suggested Genes
45
Sensitivity Analysis
46
(No Transcript)
47
(No Transcript)
48
What we learned Today

Extending the TDT to a conditional logistic
regression model has many advantages
analyze multiple loci
Bayesian setting can incorporate genetic
association
and more!
We can find genes using Maximum likelihood
estimation and inference for the parameters of
the CLR model using STATA
We can improve the estimates of MLE by using SSGS
with a prior that accounts for genetic
association
SSGS has some sensitivity to prior lower prior,
less genes

49
References

Barbieri, M.M., and Berger, J. O. (2004), Optimal
Predictive Model Selection, Annals of Statistics
32, to appear.
Schaid, D. (1996) General Score tests for
Associations of Genetic Markers with Disease
Using Cases and Their Parents. Genetic
Epidemiology. pp. 423-449
Self, S.G., et al. (1991) On estimating
HLA/disease association with applications to a
study of Aplastic Anemia. Biometrics, pp.53-61.
Thomas, D. C., et. al. (1995) Variation in
HLA-associated risks of Childhood Insulin
Dependent Diabetes in the Finnish population II.
Haplotype Effects Genetic Epidemiology. pp.
455-466.
SSGS dissertation https//epi.mdanderson.org/ms
wartz/

50
Papers Extending SSVS

Chipman, H. (1996) Bayesian variable selection
with related predictors. The Canadian Journal
of Statistics pp. 17-36.
George, E. I., McCulloch, R.E., and Tsay, R.S.
(1996). Two approaches to bayesian model
selections with applications Bayesain Analysis
in Econometrics and Statistics-Essays in honor of
Arnold Zellner. (Eds. D.A. Berry, K.A. Chaloner,
and J.K. Geweke). New York Wiley pp. 339-348.
Ntzoufras, I. Forster, J.J., and Dellaportas, P.
(2000) Stochastic Search Variable Selection for
Log-Linear Models Journal of Statistical
Computations and Simulations. pp.23-37