QTL mapping in mice, completed. - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

QTL mapping in mice, completed.

Description:

n backcross mice; M markers. xij = genotype (0/1) of mouse i at marker j ... Mouse i has phenotype yi and marker data mi. ... and phenotypes for all mice ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 33
Provided by: WEHI1
Category:

less

Transcript and Presenter's Notes

Title: QTL mapping in mice, completed.


1
QTL mapping in mice, completed.
  • Lecture 12, Statistics 246
  • March 2, 2004

2
REVIEW
  • n backcross mice M markers in all, with at most
    a handful expected to be near QTL
  • xij genotype (0/1) of mouse i at marker j
  • yi phenotype (trait value) of mouse i
  • Yi ? ?j1M ?jxij ?j
    Which ?j ? 0 ?
  • ?Variable selection in linear models (regression)

3
Variable selection in linear models
a few generalities
  • There is a huge literature on the selection
    of variables in linear models. Why? First, if we
    leave out variables which should be in our model,
    we risk biasing the parameter estimates of those
    we include, and our error has systematic
    components, reducing our ability to identify
    those which should be there. Second, when we
    includes variables in a model which do not need
    to be there, we reduce the precision of the
    estimates of the ones which are there.
  • We need to include all the relevant (perhaps
    important is a better word), and no irrelevant
    variables in our model. This is an instance of
    classic statistical trade-off of bias and
    variance too few variables, higher bias too
    many, higher variance. The literature contains a
    host of methods of addressing this problem, known
    as model or variable selection methods. Some are
    general, i.e. not just for linear models, e.g.
    the Akaike information criterion (AIC), the Bayes
    information criterion (BIC), or cross-validation
    (CV). Others are specifically designed for linear
    models, e.g. ridge regression, Mallows Cp, and
    the lasso. We dont have the time to offer a
    thorough review of this topic, but well comment
    on and compare a few approaches.

4
Special features of our problem
  • We have a known joint distribution of the
    covariates (marker genotypes), and these are
    (approximately) Markovian
  • We may have lots of missing covariate
    information, but it can be dealt with.that is,
    we can effectively use all our data
  • Our aim is to identify the key players (markers),
    not to minimize prediction error or find the
    true model.
  • Conclusion Our problem is not a generic linear
    model selection problem, and this offers the hope
    that we can do better than we might in general.
  • END REVIEW

5
Finding QTL as model selection(the loci are part
of the model)
  • Select class of models
  • Additive models
  • Additive plus pairwise interactions
  • Regression trees
  • Compare models (?)
  • BIC?(?) logRSS(?) ?(?log n/n)
  • Sequential permutation tests
  • Search model space
  • Forward selection (FS)
  • Backward elimination (BE)
  • FS followed by BE
  • MCMC
  • Assess performance
  • Maximize no QTL found
  • control false positive rate

6
Model class additive (linear) models
  • n backcross mice M markers
  • xij genotype (0/1) of mouse i at marker j
  • yi phenotype (trait value) of mouse i
  • Yi ? ?j1M ?jxij ?j
  • The model is characterized by its variable subset
    ? j ?j ? 0.
  • We could also include some pairwise interactions
    of variables (loci) in the model.

7
Another model class regression trees
Tree model here ? would include splits on
genotypes at specified loci (q1 etc) and
average QT values at tips.
Regression trees are good for capturing
nonlinearities.
8
Model comparison criteria
  • Within a model class, we want to compare models.
    One obvious criterion is the residual sum of
    squares, RSS(?).
  • Exercise. Prove that for a normal linear model,
    with unknown regression coefficients and error
    variance, the maximized log-likelihood lmax is
    essentially -(n/2)logRSS(?).
  • However, we know that including more variables
    reduces the RSS, without necessarily improving
    the quality of our model we need a penalty for
    model size or model complexity. There are many
    such, and we choose to minimize criteria of the
    form
  • ?(?) logRSS(?) ?n-1D(n)
  • for suitable D(n). The case D(n) n is called
    AIC, while D(n) logn is BIC, and D(n) loglogn
    is due to Hannan and Quinn.
  • Under certain assumptions, the last two are
    consistent estimators of a true model, while AIC
    tends to overfit under the same assumptions. It
    has different optimality properties. For finding
    QTL, we prefer BIC, but use D(n) ?logn, which
    we call BIC?, with ? gt 1.

9
Model comparison a Bayesian approach
  • We could place prior probabilities on each of
    the possible models, as well as the model
    parameters, and use Bayes theorem to calculate
    the posterior probability of the models given the
    data. It would make good sense (for a
    non-Bayesian!) to choose the model with the
    maximum posterior probability.
  • Assume y ?, ?? ,?2 N(X? ?? , ?2 ), and
    use the priors ?? N(0, c?2(X? X? )-1 ), and
    p(?2 ?) ? 1/?2 , while for the model ? itself,
    assume p(?) ? (c/d)?/2 . Now let c ?? (a
    diffuse improper prior), and integrate out ?? and
    ?2. The resulting posterior distribution for ?
    satisfies
  • -(2/n) log p(?y)
    logRSS(?) ?n-1 logd.
  • This is just our general criterion with D(n)
    d, which provides further support for it.
    However, the only real justification for it is
    its performance.
  • Exercise. Derive the expression involving p(?y)
    just given.

10
Sequential permutation tests
  • Lets begin by seeing how permutation tests
    are applied in this context, an idea first put
    forward by Churchill Doerge.
  • Mouse i has phenotype yi and marker data mi.
    To test the null hypothesis of no association
    between phenotype and marker data, we can simply
    de-couple the two create many sets of
    pseudo-data by choosing a random permutation ? of
    the labels 1, .., n, and forming pairs (yi ,
    m?(i)), i 1,n.
  • To assign p-values to LOD scores, say, we
    would compare the value for the observed data to
    the set of values obtained for (say) 1,000 sets
    of pseudo-data obtained in this way. As always,
    the question of significance levels arises, as
    does the issue of multiple testing. At least for
    the genome-wide calculation of LOD scores,
    permutation testing gives a neat solution simply
    repeat on the pseudo-data whatever multiple
    testing one does with the actual data (e.g.
    taking the genome wide maximum LOD).
  • For model selection, permutation tests are
    carried out to assess the significance associate
    with adding or dropping a variable. Fixed
    significance levels are typically used, e.g.
    0.05, with no corrections for multiple testing.

11
Some methods for model search
  • Forward selection (FS) selects the variable not
    in the model which provides the greatest
    reduction in the penalty function (with a rule
    for making no change). Rejecting at a fixed
    significance level is usually adopted. Called
    greedy by computer scientists.
  • Backwards elimination (BE) eliminates the
    variable whose removal makes the smallest
    increase in the penalty function (with a rule for
    making no change). Again, the use of a fixed
    significance level is standard.
  • FS followed by BE deliberately overfit, then
    prune back. We need suitable parameters to
    implement this approach.
  • Markov chain Monte Carlo (MCMC) provides a way
    of moving around the class of all models,
    selecting new or eliminating existing variables
    according to a stochastic rule. Eventually, we
    can choose as the preferred model, the one which
    minimizes the criterion function among all those
    models visited. A Bayesian would compute
    empirical posterior probabilities using the
    series of models visited.

12
We like BIC? why?
  • For a fixed number of markers, letting n ??,
    BIC? is consistent
  • There exists a prior (on models coefficients)
    for which BIC? is the -log posterior
  • BIC? is essentially equivalent to use of a
    threshold on the conditional LOD score
  • It performs well

13
Choice of ?
  • Larger ? include more loci, higher false
    positive rate
  • Smaller ? include fewer loci lower false
    positive rate
  • Let L 95 genome-wide LOD threshold
  • (comparing single QTL models to the null
    model)
  • Choose ? 2 L/log10n .
  • With this choice of ?, in the absence of QTL,
    well include at least one extraneous locus, 5
    of the time.

14
Simulations comparing strategies
  • Backcross with n 250
  • No crossover interference
  • 9 chromosomes, each 100cM
  • Markers at 10cM spacing complete genotype
    data
  • 7 QTL, equal effects 0.76
  • -One pair in coupling (same sign)
  • -One pair in repulsion (opp. sign)
  • -Three unlinked QTL
  • Heritability 50 (env var 1.0)
  • 2,000 simulated replicates

15
Methods compared
  • ANOVA at marker loci
  • Composite interval mapping (CIM)
  • Forward selection with permutation tests
  • Forward selection with BIC?
  • Backward elimination with BIC?
  • FS followed by BE with BIC?
  • MCMC with BIC?

16
Methods, some detail
  • ANOVA at marker loci essentially 2 sample
    t-tests, with marker genotype defined groups,
    used simulated genome-wide threshold
  • Composite interval mapping (CIM) described on
    next slide
  • Forward selection with permutation tests (1,000
    perms, ?0.05)
  • Forward selection to 25 markers, with BIC?
  • Backward elimination, starting from the full
    model, with BIC?
  • FS to 25 markers, followed by BE, with BIC?
  • MCMC for 1,000 moves with BIC?
  • More details and results for other sample sizes
    can be found in Broman Speed, JRSSB, 2002.

17
Composite interval mapping
  • Here an initial subset S of markers is
    chosen to control for background variation.
  • Then, as in conventional interval mapping, a
    genome-wide scan is carried out, at each locus
    testing the hypothesis that a QTL at that locus
    need not be added to the existing model, against
    the alternative that it should be added.
    Markers in S in close proximity to the current
    locus (e.g. 10cM) are excluded.
  • As in standard interval mapping, LOD scores
    are computed, and the result plotted. Regions
    where the LOD exceeds a suitable genome-wide
    threshold C are thought to contain QTL. Here we
    simulated to get C.
  • A key question is how many markers should
    there be in the initial subset S, and how should
    they be chosen. There is no sharp answer to this
    question. We used 3, 5, 7, 9, and 11.

18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Conclusions from the simulation
  • The BIC? methods performed pretty well, best with
    MCMC, though only slightly better than FS. FS
    with permutation tests had fewer extraneous than
    FS with BIC.
  • The CIM ones didnt do so well, except when the
    of variables was 7, though their rate of
    extraneous loci was quite low.
  • ANOVA was quite poor, in relative terms.

25
Selective genotyping
  • The idea here is to genotype only animlas in
    the high and low extremes of the phenotypic
    distribution, but including the phenotypes
    (without genotypes) of the remainder in the
    analysis. This involves some loss of power,
    though not much if the not genotyped isnt
    large. Two issues why would we include
    phenotypes of animals not genotyped, and how
    would we do that.
  • To make the point, we compare
  • Strategy A Analyse all genotypes and all
    phenotypes
  • Strategy B Analyse all phenotypes, but
    include genotypes only for individuals in the
    extremes of the phenotypic distribution
  • Strategy C Analyse genotypes and phenotypes
    only for individuals in the extremes of the
    phenotypic distribution
  • Exercise. Write down the likelihoods in all
    three cases.

26
Selective genotyping in a BC
27
Some simulation results
  • The slides that follow are for a hypothetical F2
    intercross with 150 individual mice, a set of 150
    markers quite similar to those of our NOD B6
    cross, comparing 3 analysis strategies
  • A analyse both genotypes and phenotypes for all
    mice
  • B analyse phenotypes for all mice, but only
    genotypes for the phenotypically extreme mice,
    correctly incorporating the other genotypes as
    missing
  • C analyse genotypes and phenotypes for
    phenotypically extreme mice only, and not saying
    so.

28
Type I error for given genome-wide LOD
thresholds
Conclusion strategy C needs a different
genome-wide threshold.
29
Power for a given QTL effect ?
Solid strategy A Dashed strategy B, with 50
genotyped
30
Power for a given proportion not genotyped
Proportion not genotyped
Solid strategy A, Dashed strategy B with ?
0.25.
31
Summary
  • QTL mapping is a model selection problem
  • Key issue the comparison of models
  • Large scale simulations are important
  • More refined procedures do not necessarily give
    improved results
  • BIC? with forward selection followed by backward
    elimination works quite well
  • Real savings to be made by genotyping only
    extremes phenotypically

32
Acknowledgements
  • Karl Broman, Johns Hopkins
  • Nusrat Rabbee, UCB
Write a Comment
User Comments (0)
About PowerShow.com