What You See May Not Be What You Get: A Primer on Regression Artifacts - PowerPoint PPT Presentation

About This Presentation
Title:

What You See May Not Be What You Get: A Primer on Regression Artifacts

Description:

Dichotomization of a variable measured with error (y = .4x e) ... Doesn't always make measurement sense. Almost always reduces power ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 95
Provided by: duke
Learn more at: https://people.duke.edu
Category:

less

Transcript and Presenter's Notes

Title: What You See May Not Be What You Get: A Primer on Regression Artifacts


1
What You See May Not Be What You Get A Primer
on Regression Artifacts
  • Michael A. Babyak, PhD
  • Duke University Medical Center

2
(No Transcript)
3
Topics to Cover
  1. Models what and why?
  2. Preliminariesrequirements for a good model
  3. Dichotomizing a graded or continuous variable is
    dumb
  4. Using degrees of freedom wisely
  5. Covariate selection
  6. Transformations and smoothing techniques for
    non-linear effects
  7. Resampling as a superior method of model
    validation

4
What is a model ?
Y f(x1, x2, x3xn)
Y a b1x1 b2x2bnxn
Y e a b1x1 b2x2bnxn
5
Why Model? (instead of test)
  • Can capture theoretical/predictive system
  • Estimates of population parameters
  • Allows prediction as well as hypothesis testing
  • More information for replication

6
Preliminaries
  1. Correct model
  2. Measure well and dont throw information away
  3. Adequate Sample Size

7
Correct Model
  • Gaussian General Linear Model
  • Multiple linear regression
  • Binary (or ordinal) Generalized Linear Model
  • Logistic Regression
  • Proportional Odds/Ordinal Logistic
  • Time to event
  • Cox Regression
  • Distribution of predictors generally not important

8
Measure well and dont throw information away
  • Reliable, interpretable
  • Use all the information about the variables of
    interest
  • Dont create clinical cutpoints before modeling
  • Model with ALL the data first, then use
    prediction to make decisions about cutpoints

9
Dichotomizing for Convenience Can Destroy a Model
10
Implausible measurement assumption
not depressed
depressed
A
B
C
Depression score
11
Dichotomization, by definition, reduces power by
a minimum of about 30
http//psych.colorado.edu/mcclella/MedianSplit/
12
Dichotomization, by definition, reduces power by
a minimum of about 30
Dear Project Officer, In order to facilitate
analysis and interpretation, we have decided to
throw away about 30 of our data. Even though
this will waste about 3 or 4 hundred thousand
dollars worth of subject recruitment and testing
money, we are confident that you will
understand. Sincerely, Dick O. Tomi, PhD Prof.
Richard Obediah Tomi, PhD
13
Examples from the WCGS StudyCorrelations with
CHD Mortality (n 750)
14
Dichotomizing does not reduce measurement error
Gustafson, P. and Le, N.D. (2001). A comparison
of continuous and discrete measurement error is
it wise to dichotomize imprecise covariates?
Submitted. Available at http//www.stat.ubc.ca/peo
ple/gustaf.
15
Simulation Dichotomizing makes matters worse
when measure is unreliable
b1 .4
X1
Y
True Model X1 continuous
16
Simulation Dichotomizing makes matters worse
when measure is unreliable
b1 .4
X1
Y
Same Model with X1 dichotomized
17
Simulation Dichotomizing makes matters worse
when measure is unreliable
Reliability.65, .75., .85, 1.00
b1 .4
X1
Y
Contin.
b1 .4
Dich.
Y
X1
Models with reliability of X1 manipulated
18
Dichotomization of a variable measured with error
(y .4x e)
19
Dichotomization of a variable measured with error
(y .4x e)
20
Dichotomizing will obscure non-linearity
21
Dichotomizing will obscure non-linearity
22
Simulation 2 Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 and X2 continuous
b1 .4
X1
Y
X2
b2 .0
23
Simulation 2 Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 dichotomized
b1 .4
X1
Y
X2
b2 .0
24
Simulation 2 Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 dichotomized rho12 manipulated
b1 .4
X1
r12 .0, .4, .7
Y
X2
b2 .0
25
Simulation 2 Dichotomizing a continuous
predictor that is correlated with another
predictor
26
Simulation 2 Dichotomizing a continuous
predictor that is correlated with another
predictor
27
Is it ever a good idea to categorize
quantitatively measured variables?
  • Yes
  • when the variable is truly categorical
  • for descriptive/presentational purposes
  • for hypothesis testing, if enough categories are
    made.
  • However, using many categories can lead to
    problems of multiple significance tests and still
    run the risk of misclassification

28
CONCLUSIONS
  • Cutting
  • Doesnt always make measurement sense
  • Almost always reduces power
  • Can fool you with too much power in some
    instances
  • Can completely miss important features of the
    underlying function
  • Modern computing/statistical packages can
    handle continuous variables
  • Want to make good clinical cutpoints? Model
    first, cut later.

29
Clinical Events and LVEF Change during Mental
Stress 5 Year follow-up
Model first, cut later
Probevent
Maximum Change in LVEF ()
30
Requirements Sample Size
  • Linear regression
  • minimum of N 50 8predictor (Green, 1990)
  • Logistic Regression
  • Minimum of N 10-15/predictor among smallest
    group (Peduzzi et al., 1990a)
  • Survival Analysis
  • Minimum of N 10-15/predictor (Peduzzi et al.,
    1990b)

31
Concept of Simulation
Y b X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
32
Concept of Simulation
Y b X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
Evaluate
33
Simulation Example
Y .4 X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
34
Simulation Example
Y .4 X error
bs1
bs2
bsk-1
bsk
bs3
bs4
.
Evaluate
35
True ModelY .4x1 e
36
Sample Size
  • Linear regression
  • minimum of N 50 8predictor (Green, 1990)
  • Logistic Regression
  • Minimum of N 10-15/predictor among smallest
    group (Peduzzi et al., 1990a)
  • Survival Analysis
  • Minimum of N 10-15/predictor (Peduzzi et al.,
    1990b)

37
All-noise, but good fit
38
Simulation number of events/predictor ratio
Y .5x1 0x2 .2x3 0x4 -- Where r x1 x4
.4 -- N/p 3, 5, 10, 20, 50
39
Parameter stability and n/p ratio
40
Peduzzis Simulation number of events/predictor
ratio
P(survival) a b1NYHA b2CHF b3VES b4DM
b5STD b6HTN b7LVC --Events/p 2, 5,
10, 15, 20, 25 -- relative bias
(estimated b true b/true b)100
41
Simulation results number of events/predictor
ratio
42
Simulation results number of events/predictor
ratio
43
Predictor (covariate) selection
  • Theory, substantive knowledge, prior models
  • Testing for confounding
  • Univariate testing
  • Last (and least), automated methods, aka stepwise
    and best subset regression

44
Searching for Confounders
  • Fundamental tension between underfitting and
    overfitting
  • Underfitting not adjusting for important
    confounders
  • Overfitting capitalizing on chance relations
    (sample fluctuation)

45
Covariate selection
  • Overfitting has been studied extensively
  • Scariest study is by Faraway (1992)showed that
    any pre-modeling strategy cost a df over and
    above df used later in modeling.
  • Premodeling strategies included variable
    selection, outlier detection, linearity tests,
    residual analysis.

46
Covariate selection
  • Therefore, if you transform, select, etc., you
    must include the DF in (i.e., penalize for) the
    Final Model

47
Covariate selection Univariate Testing
  • Non-Significant tests also cost a DF
  • Variables may not behave the same way in a
    multivariable modelvariable not significant at
    univariate test may be very important in the
    presence of other variables

48
Covariate selection
  • Despite the convention, testing for confounding
    has not been systematically studiedlikely leads
    to overadjustment and underestimate of true
    effect of variable of interest.
  • At the very least, pulling variables in and out
    of models inflates the Type I error rate,
    sometimes dramatically

49
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high
50
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution
51
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med)
52
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem
53
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996)
54
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity
55
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity 7. It
is based on methods (e.g. F- tests for nested
models) that were intended to be used to test
pre-specified hypotheses
56
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity 7. It
is based on methods (e.g. F tests for nested
models) that were intended to be used to test
pre-specified hypotheses. 8. Increasing the
sample size doesn't help very much (see Derksen
and Keselman)
57
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity 7. It
is based on methods (e.g. F tests for nested
models) that were intended to be used to test
pre-specified hypotheses. 8. Increasing the
sample size doesn't help very much (see Derksen
and Keselman) 9. It allows us to not think about
the problem
58
SOME of the problems with stepwise variable
selection.
1. It yields R-squared values that are badly
biased high 2. The F and chi-squared tests
quoted next to each variable on the printout do
not have the claimed distribution 3. The method
yields confidence intervals for effects and
predicted values that are falsely narrow (See
Altman and Anderson Stat in Med) 4. It yields
P-values that do not have the proper meaning and
the proper correction for them is a very
difficult problem 5. It gives biased regression
coefficients that need shrinkage (the
coefficients for remaining variables are too
large see Tibshirani, 1996). 6. It has severe
problems in the presence of collinearity 7. It
is based on methods (e.g. F tests for nested
models) that were intended to be used to test
pre-specified hypotheses. 8. Increasing the
sample size doesn't help very much (see Derksen
and Keselman) 9. It allows us to not think about
the problem 10. It uses a lot of paper
59
  • I now wish I had never written the stepwise
    selection code for SAS.
  • --Frank Harrell, author of forward and backwards
    selection algorithm for SAS PROC REG

60
Automated Selection Derksen and Keselman (1992)
Simulation Study
  • Studied backward and forward selection
  • Some authentic variables and some noise variables
    among candidate variables
  • Manipulated correlation among candidate
    predictors
  • Manipulated sample size

61
Automated Selection Derksen and Keselman (1992)
Simulation Study
  • The degree of correlation between candidate
    predictors affected the frequency with which the
    authentic predictors found their way into the
    model.
  • The greater the number of candidate predictors,
    the greater the number of noise variables were
    included in the model.
  • Sample size was of little practical importance
    in determining the number of authentic variables
    contained in the final model.

62
Simulation results Number of Noise Variables
Included
Sample Size
20 candidate predictors 100 samples
63
Simulation results R-Square From Noise Variables
Sample Size
20 candidate predictors 100 samples
64
Simulation results R-Square From Noise Variables
Sample Size
20 candidate predictors 100 samples
65
Variable Selection
  • Pick variables a priori
  • Stick with them
  • Penalize appropriately for any data-driven
    decision about how to model a variable

66
Spending DF wisely
  • Select variables of most importance
  • Use DF to assess non-linearity using flexible
    curve approach (more about this later)
  • If not enough N/predictor, combine covariates
    using techniques that do not look at Y in the
    sample, PCA, FA, conceptual clustering,
    collapsing, scoring, established indexes,
    propensity scores.

67
Can use data to determine where to spend DF
  • Use Spearmans Rho to test importance
  • Not peeking because we have chosen to include the
    term in the model regardless of relation to Y
  • Use more DF for non-linearity

68
Example-Predict Survival from age, gender, and
fare on Titanic
69
If you have already decided to include them (and
promise to keep them in the model) you can peek
at predictors in order to see where to add
complexity
70
(No Transcript)
71
Non-linearity using splines
72
Linear Spline (piecewise regression)
Y a b1(xlt10) b2(10ltxlt20) b3 (x gt20)
73
Cubic Spline (non-linear piecewise regression)
knots
74
Logistic regression model
fitfarelt-lrm(survived(rcs(fare,3)agesex)2,xT,
yT) anova(fitfare)
Spline with 3 knots
75
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
76
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
77
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
78
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
79
Wald Statistics Response survived
Factor
Chi-Square d.f. P fare (FactorHigher
Order Factors) 55.1 6 lt.0001 All
Interactions 13.8 4
0.0079 Nonlinear (FactorHigher Order
Factors) 21.9 3 0.0001 age
(FactorHigher Order Factors) 22.2 4
0.0002 All Interactions
16.7 3 0.0008 sex (FactorHigher
Order Factors) 208.7 4 lt.0001
All Interactions 20.2
3 0.0002 fare age (FactorHigher Order
Factors) 8.5 2 0.0142 Nonlinear
8.5 1 0.0036
Nonlinear Interaction f(A,B) vs. AB 8.5
1 0.0036 fare sex (FactorHigher Order
Factors) 6.4 2 0.0401 Nonlinear
1.5 1 0.2153
Nonlinear Interaction f(A,B) vs. AB 1.5
1 0.2153 age sex (FactorHigher Order
Factors) 9.9 1 0.0016 TOTAL NONLINEAR
21.9 3 0.0001
TOTAL INTERACTION 24.9
5 0.0001 TOTAL NONLINEAR INTERACTION
38.3 6 lt.0001 TOTAL
245.3 9 lt.0001
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
Validation
  • Apparent
  • too optimistic
  • Internal
  • cross-validation, bootstrap
  • honest estimate for model performance
  • provides an upper limit to what would be found on
    external validation
  • External validation
  • replication with new sample, different
    circumstances

84
Validation
  • Steyerburg, et al. (1999) compared validation
    methods
  • Found that split-half was far too conservative
  • Bootstrap was equal or superior to all other
    techniques

85
Bootstrap
My Sample
?1
?2
?3
?k-1
?k
?4
.
WITH REPLACEMENT
Evaluate
86
1, 3, 4, 5, 7, 10
7 1 1 4 5 10
10 3 2 2 2 1
3 5 1 4 2 7
2 1 1 7 2 7
4 4 1 4 2 10
87
Bootstrap Validation
Index Training Corrected
Dxy 0.6565 0.646
R2 0.4273 0.407
Intercept 0.0000 -0.011
Slope 1.0000 0.952
88
Summary
  • Think about your model
  • Collect enough data

89
Summary
  • Measure well
  • Dont destroy what youve measured

90
Summary
  • Pick your variables ahead of time and collect
    enough data to test the model you want
  • Keep all your variables in the model unless
    extremely unimportant

91
Summary
  • Use more df on important variables, fewer df on
    nuisance variables
  • Dont peek at Y to combine, discard, or transform
    variables

92
Summary
  • Estimate validity and shrinkage with bootstrap

93
Summary
  • By all means, tinker with the model later, but be
    aware of the costs of tinkering
  • Dont forget to say you tinkered
  • Go collect more data

94
Web links for references, software, and more
  • Harrells regression modeling text
  • http//hesweb1.med.virginia.edu/biostat/rms/
  • SAS Macros for spline estimation
  • http//hesweb1.med.virginia.edu/biostat/SAS/survri
    sk.txt
  • Some results comparing validation methods
  • http//hesweb1.med.virginia.edu/biostat/reports/lo
    gistic.val.pdf
  • SAS code for bootstrap
  • ftp//ftp.sas.com/pub/neural/jackboot.sas
  • S-Plus home page
  • insightful.com
  • Mike Babyaks e-mail
  • michael.babyak_at_duke.edu
  • This presentation
  • http//www.duke.edu/mbabyak
Write a Comment
User Comments (0)
About PowerShow.com