Title: QTL mapping in mice, completed.
1QTL mapping in mice, completed.
- Lecture 12, Statistics 246
- March 2, 2004
2REVIEW
- n backcross mice M markers in all, with at most
a handful expected to be near QTL - xij genotype (0/1) of mouse i at marker j
- yi phenotype (trait value) of mouse i
- Yi ? ?j1M ?jxij ?j
Which ?j ? 0 ? - ?Variable selection in linear models (regression)
3Variable selection in linear models
a few generalities
- There is a huge literature on the selection
of variables in linear models. Why? First, if we
leave out variables which should be in our model,
we risk biasing the parameter estimates of those
we include, and our error has systematic
components, reducing our ability to identify
those which should be there. Second, when we
includes variables in a model which do not need
to be there, we reduce the precision of the
estimates of the ones which are there. -
- We need to include all the relevant (perhaps
important is a better word), and no irrelevant
variables in our model. This is an instance of
classic statistical trade-off of bias and
variance too few variables, higher bias too
many, higher variance. The literature contains a
host of methods of addressing this problem, known
as model or variable selection methods. Some are
general, i.e. not just for linear models, e.g.
the Akaike information criterion (AIC), the Bayes
information criterion (BIC), or cross-validation
(CV). Others are specifically designed for linear
models, e.g. ridge regression, Mallows Cp, and
the lasso. We dont have the time to offer a
thorough review of this topic, but well comment
on and compare a few approaches.
4Special features of our problem
- We have a known joint distribution of the
covariates (marker genotypes), and these are
(approximately) Markovian - We may have lots of missing covariate
information, but it can be dealt with.that is,
we can effectively use all our data - Our aim is to identify the key players (markers),
not to minimize prediction error or find the
true model. - Conclusion Our problem is not a generic linear
model selection problem, and this offers the hope
that we can do better than we might in general. - END REVIEW
5Finding QTL as model selection(the loci are part
of the model)
- Select class of models
- Additive models
- Additive plus pairwise interactions
- Regression trees
- Compare models (?)
- BIC?(?) logRSS(?) ?(?log n/n)
- Sequential permutation tests
- Search model space
- Forward selection (FS)
- Backward elimination (BE)
- FS followed by BE
- MCMC
- Assess performance
- Maximize no QTL found
- control false positive rate
6Model class additive (linear) models
- n backcross mice M markers
- xij genotype (0/1) of mouse i at marker j
- yi phenotype (trait value) of mouse i
- Yi ? ?j1M ?jxij ?j
- The model is characterized by its variable subset
? j ?j ? 0. - We could also include some pairwise interactions
of variables (loci) in the model.
7Another model class regression trees
Tree model here ? would include splits on
genotypes at specified loci (q1 etc) and
average QT values at tips.
Regression trees are good for capturing
nonlinearities.
8Model comparison criteria
- Within a model class, we want to compare models.
One obvious criterion is the residual sum of
squares, RSS(?). - Exercise. Prove that for a normal linear model,
with unknown regression coefficients and error
variance, the maximized log-likelihood lmax is
essentially -(n/2)logRSS(?). - However, we know that including more variables
reduces the RSS, without necessarily improving
the quality of our model we need a penalty for
model size or model complexity. There are many
such, and we choose to minimize criteria of the
form - ?(?) logRSS(?) ?n-1D(n)
- for suitable D(n). The case D(n) n is called
AIC, while D(n) logn is BIC, and D(n) loglogn
is due to Hannan and Quinn. - Under certain assumptions, the last two are
consistent estimators of a true model, while AIC
tends to overfit under the same assumptions. It
has different optimality properties. For finding
QTL, we prefer BIC, but use D(n) ?logn, which
we call BIC?, with ? gt 1.
9Model comparison a Bayesian approach
- We could place prior probabilities on each of
the possible models, as well as the model
parameters, and use Bayes theorem to calculate
the posterior probability of the models given the
data. It would make good sense (for a
non-Bayesian!) to choose the model with the
maximum posterior probability. -
- Assume y ?, ?? ,?2 N(X? ?? , ?2 ), and
use the priors ?? N(0, c?2(X? X? )-1 ), and
p(?2 ?) ? 1/?2 , while for the model ? itself,
assume p(?) ? (c/d)?/2 . Now let c ?? (a
diffuse improper prior), and integrate out ?? and
?2. The resulting posterior distribution for ?
satisfies - -(2/n) log p(?y)
logRSS(?) ?n-1 logd. - This is just our general criterion with D(n)
d, which provides further support for it.
However, the only real justification for it is
its performance. - Exercise. Derive the expression involving p(?y)
just given.
10Sequential permutation tests
- Lets begin by seeing how permutation tests
are applied in this context, an idea first put
forward by Churchill Doerge. - Mouse i has phenotype yi and marker data mi.
To test the null hypothesis of no association
between phenotype and marker data, we can simply
de-couple the two create many sets of
pseudo-data by choosing a random permutation ? of
the labels 1, .., n, and forming pairs (yi ,
m?(i)), i 1,n. - To assign p-values to LOD scores, say, we
would compare the value for the observed data to
the set of values obtained for (say) 1,000 sets
of pseudo-data obtained in this way. As always,
the question of significance levels arises, as
does the issue of multiple testing. At least for
the genome-wide calculation of LOD scores,
permutation testing gives a neat solution simply
repeat on the pseudo-data whatever multiple
testing one does with the actual data (e.g.
taking the genome wide maximum LOD). -
- For model selection, permutation tests are
carried out to assess the significance associate
with adding or dropping a variable. Fixed
significance levels are typically used, e.g.
0.05, with no corrections for multiple testing.
11Some methods for model search
- Forward selection (FS) selects the variable not
in the model which provides the greatest
reduction in the penalty function (with a rule
for making no change). Rejecting at a fixed
significance level is usually adopted. Called
greedy by computer scientists. - Backwards elimination (BE) eliminates the
variable whose removal makes the smallest
increase in the penalty function (with a rule for
making no change). Again, the use of a fixed
significance level is standard. - FS followed by BE deliberately overfit, then
prune back. We need suitable parameters to
implement this approach. - Markov chain Monte Carlo (MCMC) provides a way
of moving around the class of all models,
selecting new or eliminating existing variables
according to a stochastic rule. Eventually, we
can choose as the preferred model, the one which
minimizes the criterion function among all those
models visited. A Bayesian would compute
empirical posterior probabilities using the
series of models visited.
12We like BIC? why?
- For a fixed number of markers, letting n ??,
BIC? is consistent - There exists a prior (on models coefficients)
for which BIC? is the -log posterior - BIC? is essentially equivalent to use of a
threshold on the conditional LOD score - It performs well
13Choice of ?
- Larger ? include more loci, higher false
positive rate - Smaller ? include fewer loci lower false
positive rate - Let L 95 genome-wide LOD threshold
- (comparing single QTL models to the null
model) - Choose ? 2 L/log10n .
- With this choice of ?, in the absence of QTL,
well include at least one extraneous locus, 5
of the time.
14Simulations comparing strategies
- Backcross with n 250
- No crossover interference
- 9 chromosomes, each 100cM
- Markers at 10cM spacing complete genotype
data - 7 QTL, equal effects 0.76
- -One pair in coupling (same sign)
- -One pair in repulsion (opp. sign)
- -Three unlinked QTL
- Heritability 50 (env var 1.0)
- 2,000 simulated replicates
15Methods compared
- ANOVA at marker loci
- Composite interval mapping (CIM)
- Forward selection with permutation tests
- Forward selection with BIC?
- Backward elimination with BIC?
- FS followed by BE with BIC?
- MCMC with BIC?
16Methods, some detail
- ANOVA at marker loci essentially 2 sample
t-tests, with marker genotype defined groups,
used simulated genome-wide threshold - Composite interval mapping (CIM) described on
next slide - Forward selection with permutation tests (1,000
perms, ?0.05) - Forward selection to 25 markers, with BIC?
- Backward elimination, starting from the full
model, with BIC? - FS to 25 markers, followed by BE, with BIC?
- MCMC for 1,000 moves with BIC?
- More details and results for other sample sizes
can be found in Broman Speed, JRSSB, 2002.
17Composite interval mapping
- Here an initial subset S of markers is
chosen to control for background variation. -
- Then, as in conventional interval mapping, a
genome-wide scan is carried out, at each locus
testing the hypothesis that a QTL at that locus
need not be added to the existing model, against
the alternative that it should be added.
Markers in S in close proximity to the current
locus (e.g. 10cM) are excluded. - As in standard interval mapping, LOD scores
are computed, and the result plotted. Regions
where the LOD exceeds a suitable genome-wide
threshold C are thought to contain QTL. Here we
simulated to get C. - A key question is how many markers should
there be in the initial subset S, and how should
they be chosen. There is no sharp answer to this
question. We used 3, 5, 7, 9, and 11.
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Conclusions from the simulation
- The BIC? methods performed pretty well, best with
MCMC, though only slightly better than FS. FS
with permutation tests had fewer extraneous than
FS with BIC. - The CIM ones didnt do so well, except when the
of variables was 7, though their rate of
extraneous loci was quite low. - ANOVA was quite poor, in relative terms.
25Selective genotyping
- The idea here is to genotype only animlas in
the high and low extremes of the phenotypic
distribution, but including the phenotypes
(without genotypes) of the remainder in the
analysis. This involves some loss of power,
though not much if the not genotyped isnt
large. Two issues why would we include
phenotypes of animals not genotyped, and how
would we do that. - To make the point, we compare
-
- Strategy A Analyse all genotypes and all
phenotypes -
- Strategy B Analyse all phenotypes, but
include genotypes only for individuals in the
extremes of the phenotypic distribution -
- Strategy C Analyse genotypes and phenotypes
only for individuals in the extremes of the
phenotypic distribution - Exercise. Write down the likelihoods in all
three cases. -
26Selective genotyping in a BC
27Some simulation results
- The slides that follow are for a hypothetical F2
intercross with 150 individual mice, a set of 150
markers quite similar to those of our NOD B6
cross, comparing 3 analysis strategies - A analyse both genotypes and phenotypes for all
mice - B analyse phenotypes for all mice, but only
genotypes for the phenotypically extreme mice,
correctly incorporating the other genotypes as
missing - C analyse genotypes and phenotypes for
phenotypically extreme mice only, and not saying
so.
28Type I error for given genome-wide LOD
thresholds
Conclusion strategy C needs a different
genome-wide threshold.
29Power for a given QTL effect ?
Solid strategy A Dashed strategy B, with 50
genotyped
30Power for a given proportion not genotyped
Proportion not genotyped
Solid strategy A, Dashed strategy B with ?
0.25.
31Summary
- QTL mapping is a model selection problem
- Key issue the comparison of models
- Large scale simulations are important
- More refined procedures do not necessarily give
improved results - BIC? with forward selection followed by backward
elimination works quite well - Real savings to be made by genotyping only
extremes phenotypically
32Acknowledgements
- Karl Broman, Johns Hopkins
- Nusrat Rabbee, UCB