Gene Mapping with Bayesian Variable Selection and MCMC - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Gene Mapping with Bayesian Variable Selection and MCMC

Description:

Implemented in Stochastic Search Gene Suggestion (SSGS) Bayesian Model Selection ... Extends Stochastic Search Variable Selection (George and McCulloch, 1993) ... – PowerPoint PPT presentation

Number of Views:243
Avg rating:3.0/5.0
Slides: 51
Provided by: michaelda3
Category:

less

Transcript and Presenter's Notes

Title: Gene Mapping with Bayesian Variable Selection and MCMC


1
Gene Mapping with Bayesian Variable Selection and
MCMC
  • Michael Swartz
  • mswartz_at_stat.tamu.edu

2
Outline
  • Intro to Genetics
  • Intro to Gene mapping, Association studies
  • The Conditional logistic regression model for
    Gene mapping
  • Bayesian Model Selection
  • Stochastic Search Variable Selection
  • Stochastic Search Gene Suggestion (SSGS)
  • Performance on Simulated Data
  • SSGS vs the MLE.

3
Intro to Genetics
4
Picture book of Genetics
Gene A specific coding region of DNA
Chromosomes Line up genes
?Locus a genes position
Alleles
Haplotype One
Genotype Both
Molecular Marker A polymorphic locus with a
known position on the chromosome
5
Linkage
Linkage
  • Violates Mendels Second law Genes segregate
    independently
  • Allows us to measure genetic distance
  • Biological source of linkage Meiosis -- the
    process of cell division that produces haploid
    gametes.
  • Genes that co-segregate in the recombinant
    gametes are linked.

6
Linkage Disequilibrium
  • Association of alleles in a population

7
Gene Mapping Association Studies
8
Data The Case-Parent Triad
Collect Haplotype information on the Parents (G)
as well as the case (g) so we have information
about the transmitted and non transmitted
haplotypes. Model the probability of transmission.
9
Gene Mapping By Association
  • Transmission Disequilibrium Test (TDT)
  • Uses transmitted and non-transmitted alleles in
    case parent triads to jointly test for linkage
    and linkage disequilbrium
  • Based on McNemars test for case-control data
  • Tests for association between two loci at a time
  • Log-linear models
  • Also used for case-control data
  • TDT triads can be modeled with Conditional
    Logistic Regression for case control data. (Self,
    et al, 1991, Thomas, et al., 1995)
  • Extends the TDT to multiple loci

10
Advantages to a log-linear model
  • Using a Bayesian model we can incorporate
    genetic association between the markers.
  • Easy to analyze multiple loci
  • Easy to consider Gene X Gene interactions
  • Easy to consider haplotypes
  • Easy to consider environmental effects
  • Easy to consider Gene X Envrionment effects

11
Advantages to a log-linear model
  • Using a Bayesian model we can incorporate genetic
    association between the markers.
  • Easy to analyze multiple loci
  • Easy to consider Gene X Gene interactions
  • Easy to consider haplotypes
  • Easy to consider environmental effects
  • Easy to consider Gene X Envrionment effects

12
Coding the Triads (Thomas et al., 1995 Schaid
1996)
  • Ex 3 diallelic loci.
  • Recall gip and GTip from the case-parent triad.
  • For the Logistic Regression model we use Zi
    gimgif.
  • This is known as GTDT coding scheme (Schaid 1996)
  • Using Haplotypes in Conditional Logistic
    Regression is one way to examine Complex Diseases
    using Triads

13
Sampling Distribution for Triads
14
The Sampling distribution a Conditional
Logistic Function (Thomas et al., 1995, Self et
al., 1991)
where G is the set of all possible transmitted
genotypes given the parents genotypes
(Pseudo-Controls)
and
15
Identifiability for Conditional Logistic
Regression Parameters
  • Gene Mapping with Conditional Logistic Regression
    (CLR) uses categorical covariates (genotpye or
    haplotype)
  • For identifiability, we must define a reference
    category for each locus
  • Choose the most prevalent allele at each locus as
    its reference allele.

16
Calculating Prevalence from Triads (Thomas, 1995)
  • Let Cla denote the number of haplotypes in the
    case that carry allele a at locus l.
  • Likewise, let Pla denote the number of haplotypes
    in the parents that carry allele a at locus l.
  • If N denotes the total number of triads, then the
    prevalence of allele a at locus l can be
    calculated by (Pla Cla)/2N

17
Using CLR to infer genes
  • Frequentist
  • Make Inference on the Maximum Likelihood
    Estimates for the ? parameters in the CLR model.
  • Requires numerical optimization
  • Prepackaged in STATA clogit command.
  • Bayesian
  • Calculate Posterior Distribution and make
    inference from the appropriate summaries
  • Requires Markov Chain Monte Carlo posterior
    simulation
  • Implemented in Stochastic Search Gene Suggestion
    (SSGS)

18
Bayesian Model Selection
19
Hierarchical Bayesian setup for Variable Selection
  • Use a Hierarchical Bayesian method
  • ? is an indicator vector of the variables, and
    ?(?) is the vector of coefficients for model ?.
  • Make inferences from the variable posterior

20
Advantages to Bayesian Hierarchical Modeling
  • Account for prior information
  • Allow for Bayesian Variable Selection Techniques
  • Make inference from model posterior
  • No multiple testing because discussing pure
    probabilities

21
Stochastic Search Variable Selection(George and
McCulloch, 1993)
  • Linear Regression Introduce a latent variable to
    indicate covariates importance.
  • Hierarchy allows prior information to enter the
    model and be updated by the data
  • Likelihood Y?,?2 Nn(X?, ?2I)
  • Model Prior ? Binomial(p)
  • Parameter Priors
  • ? ? Np(0,D?R D ?)
  • ?2? IG(??/2, ????/2) ? ????/?2

22
Stochastic Search Variable Selection(Continued)
  • Full Conditionals for ?, ?, and ?2 recognizable
    ?Gibbs Sampling
  • Generalized to Various GLMs (George, McCulloch,
    and Tsay, 1996 Ntzoufras, Forster, and
    Dellaportas, 2000 and a few others).

23
Stochastic Search Gene Suggestion
  • Extends Stochastic Search Variable Selection
    (George and McCulloch, 1993)
  • Introduces two latent variables to indicate a
    genes importance in the model one for loci and
    one for alleles.
  • Induces a hierarchy that allows prior information
    about genes to enter the model
  • Genetic structure
  • Genetic correlation
  • The hierarchical nature allows the data to update
    the probability of including a particular gene

24
Priors for Gene Suggestion
  • Use two priors for gene suggestion
  • One indicator vector for locus selection
    ?(?1,,?L),

where pl P(Locus l is associated with the
disease)
  • One indicator vector for allele selection given
    each locus ?. Each element ?la pertains to a
    particular allele at locus l.

where qla P(Allele a at locus l causes disease)
25
Prior for allele main effects ?(??,?)Allelic
dependence in model selection
  • Prior for main effects models the genetic
    dependencies between loci and alleles

where
with each kla defined as
26
How SSGS works
  • Exploits MVN Covariance matrix D?RD? (George
    and McCulloch, 1993)
  • If ? 0, then ?la focuses the probability of ?la
    around 0
  • if ? 1, then ?lacla expands the probability of
    ?la to cover reasonable values
  • Automatic methods for choosing ? and c in paper
  • Subjectively
  • choose ?la such that -3?la lt ?la lt 3?la implies
    ?la 0
  • choose cla such that 3?lacla covers reasonable
    values for ?la
  • Model information contained in P(? Data)
  • R based on Linkage Disequilbrium can be helpful
    for gene mapping

27
The Prior Covariance Matrix
  • Define the Diagonal Blocks lili using the
    covariance for a multinomial distribution using
    allele frequencies assuming they are constant
    across generation.
  • Determine the off-diagonal blocks lilj?i?j
    using the allelic disequilibirium between the
    alleles at locus i and locus j
    .
  • Define R L-1

28
Sampling from the Posterior
  • No full conditional for updating ?
  • Use Hybrid Gibbs sampling and Metropolis-Hastings
    Algorithm to construct a Markov Chain.
  • Full conditionals for updating ? and ?
  • Metropolis Hastings acceptance ratio for updating
    ? by locus
  • For a given model, sample repeatedly from
    Metropolis Hastings before proposing a new model
  • Even model iterations generated by independence
    MLE proposal
  • Odd model iterations generated by random walk
    proposal

29
Gibbs Sampling Component
  • P(?i1 ?(-i), ?, ?, g, Gm, Gf) P(?i1 ?(-i),
    ?) a1/(a0a1)
  • a1 f(? ?i1, ?(-i), ?)f(?(-i), ?i1)
  • a0 f(? ?i0, ?(-i), ?)f(?(-i), ?i0)
  • P(?i1 ?(-i), ?, ?, g, Gm, Gf) P(?i1?(-i),
    ?) b1/(b0b1)
  • b1 f(? ?i1, ?(-i), ?)f(?(-i), ?i1)
  • b0 f(? ?i0, ?(-i), ?)f(?(-i), ?i0)

30
Metropolis Hastings Component (by locus)
MH Ratio
  • Two different proposal Distributions
  • MLE independence proposal conditional on other
    loci
  • Random Walk symmetric proposal conditional on
    other loci

31
SSGS Flow Chart
32
Finding Genes
  • Using a Bayesian Model, we simply summarize the
    posterior in a meaningful way
  • The MCMC sample is a large sample from our
    posterior
  • Thus we can summarize genes importance by using
    the marginal posterior probability of inclusion
    for each gene
  • Use the median model threshold P(la) gt .5

33
Simulating Data
34
Simulated Data
  • Used genetic data simulated for Genetic Analysis
    Workshop 12 (GAW12)
  • Used Chromosome 1 from isolated population
  • Microsatellite markers simulated 1 cM apart, with
    4-16 alleles
  • Simulated without influence from selection
  • reference Wijsman, E.M. Almasy, L., Amos, C.I.,
    Borecki, I., Falk C.T., King, T.M., Martinez, M.
    M., Meyers, D., Neuman, R., Olson, J.M., Rich,
    S., Spence, M.A., Thomas, D. C., Vieland, V.J.,
    Witte, J. S., MacCluer, J.W. (2001) Genetic
    Analysis Workshop 12 Analysis of Complex Genetic
    Traits Applications to Asthma and Simulated
    Data. Genet Epidemiol 21(supp 1)S1-S853

35
Using GAW 12 Data Model Simulation
  • Simulate directly from model
  • Use the conditional logistic regression function
    to determine probability of transmission of the
    genes
  • The parents determine the 4 possible children
  • Treat each child as a category in a multinomial
    distribution
  • Calculate the probability of each child using a
    conditional logistic regression function with
    specified ?s
  • Draw 1 sample from the corresponding multinomial
    distribution to determine the affected genotype
    for the triad.
  • Know the right answers for ?
  • Analyze the data twice
  • Independent R I
  • Dependent R based on HWE LD

36
Simulation 1 Model Simulation
  • 3 loci with a total of 20 alleles, close together
  • A14 A211 A35
  • GAW 12 Chromosome 1 Loci 9, 11, and 12
  • Genetic Covariance Present
  • Average D for 3 loci span from 0.133 to 0.256
  • 90 of ? ?0.005,0.386 median 0.012
  • True Model g2, g14, g16
  • True Betas ?22.74, ?143.63, ?16 4.39
    ?-(2,14,16)0
  • 200,000 iterations

37
Running STATA
  • Data was collected in Triad
  • STATA needs pseudocontrols enumerated
  • Assuming no recombination, construct each Z
    vector (sum of the haplotypes) of the possible
    children given the parents
  • Obtain MLE and confidence intervals Run clogit
    on the data stratified by family (only the 4
    children are present in each stratification)

38
Preparing for SSGS
  • Label the haplotypes in the parents as
    transmitted or non transmitted
  • Calculate the MLEs and Fishers information
    using STATA to define the proposal distribution
    for even iterations
  • Define the initial values for
  • ? (mle)
  • ? ( l)
  • ? ( 1)

39
Simulation 1 Model Simulation
  • Independent Prior
  • p q 0.5
  • ? 0.2, c 10
  • None of the ?s failed the Heidelberger and Welch
    test for stationarity
  • Total models visited 302
  • Dependent Prior
  • p q 0.5
  • ? 0.2, c 10
  • None of the ?s failed the Heidelberger and Welch
    test for stationarity
  • Total models visited 6046

40
Simulation 1 Suggested Genes
41
Simulation 1 Estimation Intervals
42
Using GAW 12 Data Disease Simulation
  • Simulate a disease
  • Pick alleles at a marker that cause the disease
  • Simulate disease based on a determined
    penetrance,(P(Dgenes)) sporadic risk
    (P(Dnormal), and dominance
  • Know which alleles should be suggested by SSGS,
    but not the true ?
  • Analyze the data twice
  • Dependent R based on HWE LD
  • Independent R I

43
Simulation 2 Simulated Disease
  • 3 loci from GAW 12 chromosome 1 Locus 1 A16,
    Locus 2 A28, Locus 8 A8 4
  • Genetic Correlation
  • Average D values span from 0.084 to 0.29
  • 90 of ? ?0.0003,0.259 median 0.005
  • Penetrances
  • P(DL1a3,L1a3) 0.4
  • P(DL8a2,L8a2) 0.6
  • P(DL8a4,L8a4) 0.4
  • P(DL8a2,L8a4) 0.5
  • P(Dany other genes) 0.05
  • True model g3, g14, g15
  • 200,000 iterations

44
Simulation 2 Suggested Genes
45
Sensitivity Analysis
46
(No Transcript)
47
(No Transcript)
48
What we learned Today
  • Extending the TDT to a conditional logistic
    regression model has many advantages
  • analyze multiple loci
  • Bayesian setting can incorporate genetic
    association
  • and more!
  • We can find genes using Maximum likelihood
    estimation and inference for the parameters of
    the CLR model using STATA
  • We can improve the estimates of MLE by using SSGS
    with a prior that accounts for genetic
    association
  • SSGS has some sensitivity to prior lower prior,
    less genes

49
References
  • Barbieri, M.M., and Berger, J. O. (2004), Optimal
    Predictive Model Selection, Annals of Statistics
    32, to appear.
  • Schaid, D. (1996) General Score tests for
    Associations of Genetic Markers with Disease
    Using Cases and Their Parents. Genetic
    Epidemiology. pp. 423-449
  • Self, S.G., et al. (1991) On estimating
    HLA/disease association with applications to a
    study of Aplastic Anemia. Biometrics, pp.53-61.
  • Thomas, D. C., et. al. (1995) Variation in
    HLA-associated risks of Childhood Insulin
    Dependent Diabetes in the Finnish population II.
    Haplotype Effects Genetic Epidemiology. pp.
    455-466.
  • SSGS dissertation https//epi.mdanderson.org/ms
    wartz/

50
Papers Extending SSVS
  • Chipman, H. (1996) Bayesian variable selection
    with related predictors. The Canadian Journal
    of Statistics pp. 17-36.
  • George, E. I., McCulloch, R.E., and Tsay, R.S.
    (1996). Two approaches to bayesian model
    selections with applications Bayesain Analysis
    in Econometrics and Statistics-Essays in honor of
    Arnold Zellner. (Eds. D.A. Berry, K.A. Chaloner,
    and J.K. Geweke). New York Wiley pp. 339-348.
  • Ntzoufras, I. Forster, J.J., and Dellaportas, P.
    (2000) Stochastic Search Variable Selection for
    Log-Linear Models Journal of Statistical
    Computations and Simulations. pp.23-37
Write a Comment
User Comments (0)
About PowerShow.com