Title: Model Selection and Related Topics
1Overview Model Selection Theory Computing Continuo
us Model Selection
Model Selection and Related Topics
A mostly Bayesian perspective
David Madigan Rutgers
2Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics
- The Bayesian approach to statistical inference
computes probability distributions for all
unknowns (model parameters, future observables,
etc.) conditional on the observed data - Thus, denoting by q the unknowns, we compute
- p(q data) ? p(data q) x p(q)
-
- To play this game you need a prior, p(q), and a
likelihood, p(data q).
3Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Priors (per D.M. Titterington)
4Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics
- This presentation will focus mostly on the
model p(data q) - An idealization of the probabilistic process by
which mother nature generates the data - Frequently, a data analyst will entertain several
models for the data p(data q1,M1), p(data
q2,M2), etc. - This gives rise the model selection problem
(including model composition)
5Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors comparing two models/hypotheses
- Bayes Factors compare the posterior to prior odds
of one hypothesis to the posterior to prior odds
of another hypothesis
6Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors
- Jeffreys suggested the following scale for
interpreting the numerical value of a Bayes
Factor
(0.03)
7Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors
- Note that the Bayes Factor involves model
probabilities both prior, p(M), and posterior,
p(Mdata) - p(M) is the probability that model M generated
the data - What if we dont believe that any of the model
generated the data?
8Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)
- Here is density estimate for 100 univariate
observations, y1,,y100 - M0 yi N(m,t)
- M1 yi t(m,t,n)
9Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (cont.)
- Need to specify priors for everything
10Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)
- M0 yi N(m,t)
- M1 yi t(m,t,n)
- K is about 0.04
- Interesting tidbit
- posterior standard deviation of m given M0 0.165
- posterior standard deviation of m given M1 0.153
- (so model averaging can reduce the posterior
standard deviation)
11Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Improper Priors
- Using improper parameter priors means the Bayes
factor contains a ratio of unknown constants - Lots of work trying to get around this
fractional Bayes factors, partial Bayes factors,
intrinsic Bayes factors, etc. - Simpler solution use proper priors
12Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Model Probabilities
- Note that posterior models probabilities can
derive from all pairwise Bayes factors and
pairwise prior odds
13Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Selection
- If we believe that one of the candidate models
generated the data, then the predictively optimal
model has highest posterior probability - This is also true for variable selection with
standard linear models when XTX is diagonal, s2
is known, and suitable priors are used (Clyde and
Parmigiani, 1996)
14Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
15Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
16Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)
- Deviance is a standard measure of model fit
- Can summarize in two waysat posterior mean or
mode - (1)
- or by averaging over the posterior
- (2)
- (2) will be bigger (i.e., worse) than (1)
17Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)
- is a measure of model complexity.
- In the normal linear model pD(1) equals the
number of parameters - More generally pD(1) equals the number of
unconstrained parameters - DIC
- Approximately optimizes predictive log loss
18Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Other Selection Criteria
- The training error rate
- will usually be less than the true error
- Typically work with error estimates of the form
- where is an estimate of the optimism
19Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Specific Selection Criteria
(squared error loss)
20Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Criteria - Theory
- BIC is consistent when the true model is fixed
- AIC is consistent if the dimensionality of the
true model increases with N at an appropriate
rate - For standard linear models with known variance
AIC and Cp are essentially equivalent - Folklore is that AIC tends to overfit
21Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation
- Since we dont usually believe that one of the
candidate models generated the data and
predictive accuracy on future data is key, many
authors argue in favor of cross-validation - For example (Bernardo and Smith, 1984, 6.1.6)
select the model that maximizes - where xn-1(j) represents the data with
observation xj removed and x1,xk is a random
sample from the data
22Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation
- Cross-validation give a slightly biased estimate
of future accuracy because it does not use all
the data - Burman (1989) provides a bias correction
23Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection
- Important special case of model selection
- Which subset of X1,,Xd to use as predictors of
a response variable Y ? - 2d possible models. For d30, there are 109
models. For d50, there are gt1015
24Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models
Here the Xs might be
- Raw predictor variables (continuous or
coded-categorical) - Transformed predictors (X4log X3)
- Basis expansions (X4X32, X5X33, etc.)
- Interactions (X4X2 X3 )
Popular choice for estimation is least squares
25Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models
- Standard all-subsets finds the subset of size
k, k1,,p, that minimizes RSS
- Choice of subset size requires tradeoff AIC,
BIC, marginal likelihood, cross-validation, etc. - Leaps and bounds is an efficient algorithm to
do all-subsets up to about 40 variables
26Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Variable Selection
- Two key challenges
- - Exploring a space of 2d models (more about this
later) - - Choosing a p(M) ? p(g) where g indexes models
- Many applications use p(M) ? 1 but this induces a
binomial distribution over model size - Denison et al (1998) use a truncated Poisson
prior distribution for model size
27Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Bias
- Selection bias is a significant unresolved issue
- Searching model space to find the best model
tends to overfit the data - This holds even when using the close-to-unbiased
estimate of predictive performance that
cross-validation provides
28Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example
29Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example
30Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinens Solution
- Select the simplest model that gives a predictive
distribution that is close to the BMA
predictive distribution - Not obvious how to conduct this search in
high-dimensional problems
31Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation and model complexity
One standard error rule pick the simplest model
within one standard error of the minimum
32Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Post-Model Selection Statistical Inference
- Conducting a data-driven model search and then
proceeding as if the search never took place
leads to biased and overconfident inferences - Some non-Bayesian work on adjustment for model
selection (e.g., current issue of JASA)
33Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
34Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Averaging
- If we believe that one of the candidate models
generated the data, then the predictively optimal
strategy is to average over all the models. - If Q is the inferential target, Bayesian Model
Averaging (BMA) computes - Substantial empirical evidence that BMA provides
better prediction than model selection
35Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Laplaces Method for p(DM) ? p(Dq,M)p(qM)dq
(i.e., the log of the integrand divided by n)
then
and
where
is the posterior mode
36Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Tierney Kadane (1986, JASA) show the
approximation is O(n-1) - Using the MLE instead of the posterior mode is
also O(n-1) - Using the expected information matrix in ? is
O(n-1/2) but convenient since often computed by
standard software - Raftery (1993) suggested approximating by a
single Newton step starting at the MLE - Note the prior is explicit in these approximations
37Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?)
In practice has large variance
38Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Better Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?D)
Importance Sampling
39Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Newton and Rafterys Harmonic Mean Estimator
- Unstable in practice and needs modification
40Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Gibbs sampler output (Chibs method)
First note the following identity (for any ? )
p(D?) and p(?) are usually easy to evaluate.
What about p(?D)?
Suppose we decompose ? into (?1,?2) such that
p(?1D,?2) and p(?2D,?1) are available in
closed-form
41Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
The Gibbs sampler gives (dependent) draws from
p(?1, ?2 D) and hence marginally from p(?2 D)
Rao-Blackwellization
42Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
What about three parameter blocks?
OK
OK
?
To get these draws, continue the Gibbs sampler
sampling in turn from
and
43Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Metropolis sampler output (Chib
Jeliazkov)
44Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
E1 with respect to ?y
E2 with respect to q(?, ?)
45Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Savage-Dickey Density Ratio
- Suppose M0 simplifies M1 by setting one parameter
(say q1) to some constant (typically zero) - If p1(q2 q1 0) p0(q2) then
p(data M0)
p(q1 0 M1, data)
p(data M1)
p(q1 0 M1)
46Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Bayesian Information Criterion (BIC)
(SL is the negative log-likelihood)
- BIC is an O(1) approximation to p(DM)
- Circumvents explicit prior
- Approximation is O(n-1/2) for a class of priors
called unit information priors. - No free lunch (Weakliem (1998) example)
47Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
48Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
49Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
50Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Computing Variable Selection via Stepwise Methods
- Efroymsons 1960 algorithm still the most widely
used
51Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Efroymson
- F-to-Enter
- F-to-Remove
- Guaranteed to converge
- Not guaranteed to converge to the right model
Distribution not even remotely like F
52Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Trouble
- Y X1 X2
- Y almost orthogonal to X1 and X2
- Forward selection and Efroymson pick X3 alone
53Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
More Trouble
- Berk Example with 4 variables
- The forward and backward sequence is (X1, X1X2,
X1X2 X4) - The R2 for X1X2 is 0.015
54Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Even More Trouble
- Detroit example, N13, d11
- First variable selected in forward selection is
the first variable eliminated by backward
elimination - Best subset of size 3 gives RSS of 6.8
- Forwards best set of 3 has RSS 21.2
Backwards gets 23.5
55Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Alternatives to all-subsets
- Simulated Annealing, Tabu Search, etc.
56Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
MCMC for Bayesian Variable Selection (Ioannis
Ntzoufras)
http//www.ba.aegean.gr/ntzoufras/courses/bugs2/ha
ndouts/modelsel/4_1_tutorial_handouts.pdf
57Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Why MCMC?
58Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
59Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
60Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Stochastic Search Variable Selection (George
McCulloch)
61Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
SSVS Procedure
62Estimating Spina Bifida Numbers with
Capture-Recapture Methods
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
(Regal Hook, 1991, Madigan York, 1996)
Model Pr(Model) N
95 HPD
B D R 0.37 731 (701,767)
B D R 0.30 756 (714,811)
B R D 0.28 712 (681,751)
BMA 728 (682,797)
63Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Spina Bifida 95 HPDs
M3
M2
M1
BMA
64Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
65Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
66Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
67Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
68Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
69Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
70Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
71Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
72Model Uncertainty
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Posterior Variance Within-Model
Variance Between-Model Variance - Data-driven model selection is risky Part of
the evidence is spent specify the model (Leamer,
1978) - Model-based inferences can be over-precise
- Model-based predictions can be badly calibrated
- Draper (1995), Chatfield (1995)
Bayesian Model Averaging (BMA) can help
73Bayesian Model Averaging
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
74Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
75Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
76Out-of-Sample Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Average Improvement
- Data
Model Class in Predictive Probability -
over MAP Model - 1. Coronary Heart Disease Decomposable UDGs 2.2
- 2. Women and Mathematics Decomposable UDGs 0.6
- 3. Scrotal Swellings Decomposable UDGs 5.1
- 4. Crime and Punishment Linear Regression 61.3
- 5. Lung Cancer Exponential Regression 1.8
- 6. Cirrhosis Cox Survival Regression 1.8
- 7. Coronary Heart Disease Essential graphs 1.5
- 8. Women and Mathematics Essential graphs 3.0
- 9. Stroke Cox Survival Regression 15.0
77BMA Computing
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Occams Window Find parsimonious models with
large Pr(MD) and average over those. Importance
Samping (Clyde et al., JASA, 1996) MCMCMC Use
MCMC to draw from Pr(MD).
Madigan and Raftery (1991)
Madigan and York (1992)
78Gibbs MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Undirected graphical models
SSVS (George and McCulloch, 1993)
- Choose vi, vj at random (or systematically) and
draw from
79Metropolis MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. SVO Regression Outliers (Hoeting,
Raftery, Madigan, 1994,5,6)
Possible Predictors a,b,c,d Possible Outliers
13,20,40 Current model
(b,c)(13,20) Candidate Models (b)(13,20) (
b,c)(13) (c)(13,20) (b,c)(20) (b,c,d)(13,20
) (b,c)(13,20,40) (a,b,c)(13,20)
Accept the Candidate Model with Prob
80Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Bayesian Networks
- A total ordering T of V is said to be compatible
with a directed graph M, if the orientation of
the arrows in M is consistent with T. - Draw from Pr(T, M D)
- Pr(M T, D) Pr(T M, D)
- Uniform on compatible Ts
- Metropolis accept/reject
- Generate M by adding or deleting an edge from M
consistent with T - Metropolis accept/reject
81More Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Draw from Pr(Z, M D)
- Pr(M Z, D) Pr(Z M, D)
e.g. Double Sampling Missing Data (York et al,
1994)
Pr(Z, qM M, D)
Pr(Z qM, M, D) Pr(qM Z, M, D)
Reversible Jump MCMC (Green, 1995)
82Linear Regression SVT, SVO, SVOT
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Hoeting, Raftery, Madigan (1994,5,6)
- Normal-gamma conjugate priors
- Box-Cox and ACE Transformations
- Outliers (pre-process with LMS
regression)
83Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
84Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
85Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
86Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
87Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
88Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
89Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
90Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
91Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
92Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
93Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Pr(Bi0D)
94Generalized Linear Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Raftery (1996)
is minus the inverse Hessian of
evaluated at
- Idea approximate by one Newton step starting
from -
- approximation using only GLIM output
95Prior Distribution on Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Madigan, Gavrin, Raftery (1995)
- Start with uniform pre-prior, PPr(M)
- Elicit Imaginary Data from the Expert
- Update pre-prior using imaginary data to get the
real prior, Pr(M) - Provided improved predictive performance in a
particular medical example - Ibrahim and Laud (1994)
96Predicting Strokes
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
(Volinsky, Madigan, Raftery, Kronmal, 1996)
- Stroke is the third leading cause of death
amongst adults in the US - Gorelick (1995) estimated that 80 of strokes are
preventable - Cardiovascular Health Study (CHS)
- On-going Started 1989 5,201 in four counties
- 65 risk factors different for older people?
- 172/4501 strokes
97Measured Covariates
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Follow-up ranged from 3.5 to 4.5 years (Average
4.1)
98Cox Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Estimation usually based on the Partial
Likelihood
(Taplin, 1993, Draper, 1995)
99Finding the Models in Occams Window
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
- Need models with posterior probability within a
factor of C of MAP model - Approximate Leaps and Bounds
- Furnival and Wilson (1974) Lawless and Singhal
(1978) - Finds top q models of each size
- Find models within factor of C2
- Compute Exact BIC for these models
100Picture of Probs vs P-values
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
101Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
102Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Model Averaging Stepwise Top PMP
Stroke Stroke
Stroke
Low 751 7
750 8 724 10
Medium 770 24
799 27 801 28
High 645
617 51 641 48
Assigned Risk Group
103Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
104Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Shrinkage Methods
- Subset selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare. - This method can have high variance a different
dataset from the same source can result in a
totally different model - Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient
105Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
106Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
107Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression Bayesian MAP Regression
108Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Least Absolute Shrinkage Selection Operator
(LASSO)
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
109Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
110Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
111Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
112Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge LASSO - Theory
- Lasso estimates are consistent
- But, Lasso does not have the oracle property.
That is, it does not deliver the correct model
with probability 1 - Fan Lis SCAD penalty function has the Oracle
property
113Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS
- New geometrical insights into Lasso and
Stagewise - Leads to a highly efficient Lasso algorithm for
linear regression
114Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS
- Start with all coefficients bj 0
- Find the predictor xj most correlated with y
- Increase bj in the direction of the sign of its
correlation with y. Take residuals ry-yhat along
the way. Stop when some other predictor xk has as
much correlation with r as xj has - Increase (bj,bk) in their joint least squares
direction until some other predictor xm has as
much correlation with the residual r. - Continue until all predictors are in the model
115Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
116Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Statistical Analysis of Textual Data
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
117Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute
Mendenhall Glacier, Juneau, Alaska
118Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
X2 127.2, df12
119Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Used Naïve Bayes with Poisson and Negative
Binomial model - Out-of-sample predictive performance
120Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
today
- Statistical methods routinely used for textual
analyses of all kinds - Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc. - Not reported in the statistical literature (no
statisticians?)
121Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Text categorization
- Automatic assignment of documents with respect to
manually defined set of categories - Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading - Dominant technology is supervised machine
learning - Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)
122Document Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Documents usually represented as bag of words
- xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI) - Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.
123Classifier Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- For instance, linear classifier
- xis derived from text of document
- yi indicates whether to put document in category
- ßj are parameters chosen to give good
classification effectiveness
124Logistic Regression Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Linear model for log odds of category membership
- Conditional probability model
125Logistic Regression as a Linear Classifier
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- If estimated probability of category membership
is greater than p, assign document to category
- Choose p to optimize expected value of your
effectiveness measure - Can change measure w/o changing model
126Maximum Likelihood Training
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Choose parameters (ßj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)
- Maximizing (log-)likelihood can be viewed as
minimizing a loss function
127Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Hastie, Friedman Tibshirani
128Avoiding Overfitting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Text is high dimensional
- Maximum likelihood gives infinite parameter
values, poor effectiveness - Solution penalize large ßj's, e.g. maximize
- Called ridge logistic regression
129A Bayesian Interpretation of Ridge Logistic
Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Suppose
- we believe each ßj is a small value near 0
- and encode this belief as separate Gaussian
probability distributions over values of ßj - Bayes rule specifies our new (posterior) belief
about ß after seeing training data - Choosing maximum a posteriori value of the ß
gives same result as ridge logistic regression
130Zhang Oles Results
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Reuters-21578 collection
- Ridge logistic regression plus feature selection
131Bayes!
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- MAP logistic regression with Gaussian prior gives
state of the art text classification
effectiveness - But Bayesian framework more flexible than SVM for
combining knowledge with data - Feature selection
- Stopwords, IDF
- Domain knowledge
- Number of classes
- (and kernels.)
132Bayesian Supervised Feature Selection
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Results on ridge logistic regression for text
classification use ad hoc feature selection - Use of feature selection ? belief (before seeing
data) that many coefficients are 0 - Put that belief into our prior on coefficients...
- Laplace prior, i.e. lasso logistic regression
133Data Sets
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights
134Dense vs. Sparse Models (Macroaveraged F1,
Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
135Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
136Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
137Bayesian Unsupervised Feature Selection and
Weighting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Stopwords low content words that typically are
discarded - Give them a prior with mean 0 and low variance
- Inverse document frequency (IDF) weighting
- Rare words more likely to be content indicators
- Make variance of prior inversely proportional to
frequency in collection - Experiments in progress
138Bayesian Use of Domain Knowledge
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Often believe that certain words are positively
or negatively associated with category - Prior mean can encode strength of positive or
negative association - Prior variance encodes confidence
139First Experiments
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- 27 RCV1-v2 Region categories
- CIA World Factbook entry for country
- Give content words higher mean and/or variance
- Only 10 training examples per category
- Shows off prior knowledge
- Limited data often the case in applications
140Results (Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
141Polytomous Logistic Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Logistic regression trivially generalizes to
1-of-k problems - Cleaner than SVMs, error correcting codes, etc.
- Laplace prior particularly cool here
- Suppose 99 classes and a word that predicts class
17 - Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only
once - Experiments in progress, particularly on author
id
142Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
143Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
144Future
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Choose exact number of features desired
- Faster training algorithm for polytomous
- Currently using cyclic coordinate descent
- Hierarchical models
- Sharing strength among categories
- Hierarchical relationships among features
- Stemming, thesaurus classes, phrases, etc.
145Text Categorization Summary
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Conditional probability models (logistic,
probit, etc.) - As powerful as other discriminative models (SVM,
boosting, etc.) - Bayesian framework provides much richer ability
to insert task knowledge - Code http//stat.rutgers.edu/madigan/BBR
- Polytomous, domain-specific priors soon
146For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Regularized regression methods are better than
ad-hoc variable selection algorithms - Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense) - L1-regularization is the best way to variable
selection
147For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Regularized regression methods are better than
ad-hoc variable selection algorithms - Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense) - L1-regularization is the best way to variable
selection
148For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
- Regularized regression methods are better than
ad-hoc variable selection algorithms - Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense) - L1-regularization is the best way to variable
selection