Title: Logistic Regression in STATA
1Logistic Regression in STATA
- The logistic regression programs in STATA use
maximum likelihood estimation to generate the
logit (the logistic regression coefficient, which
corresponds to the natural log of the OR for each
one-unit increase in the level of the regressor
variable). - The resulting ORs are maximum-likelihood
estimates (MLEs) of the uniform effect (OR)
across strata of the model covariates. Thus they
are pooled (uniform, common) estimates of the OR
and in this sense are adjusted for all regressors
included in the model.
2STATA Logistic Regression Commands
- The logistic command in STATA yields odds
ratios. - . logistic sil hpv2 age
- Logit estimates
Number of obs 595 -
LR chi2(2) 155.28 -
Prob gt chi2 0.0000 - Log likelihood -332.23774 Pseudo R2
0.1894 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - hpv2 17.23939 5.10605 9.61
0.000 9.647363 30.80598 - age .9947965 .0146842 -0.35
0.724 .9664283 1.023997 - --------------------------------------------------
----------------------------
3STATA Logistic Regression Commands
- The logit command in STATA yields the actual
beta coefficients. - logit sil hpv2 age
- Logit estimates
Number of obs 595 -
LR chi2(2) 155.28 -
Prob gt chi2 0.0000 - Log likelihood -332.23774 Pseudo
R2 0.1894 - --------------------------------------------------
---------------------------- - sil Coef. Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - hpv2 2.847197 .2961851 9.61
0.000 2.266685 3.427709 - age -.0052171 .014761 -0.35
0.724 -.0341482 .0237139 - _cons -.2873639 .4075825 -0.71
0.481 -1.086211 .5114832 - --------------------------------------------------
---------------------------- - Note that 17.23939 exp2.847197 as in the
previous slide
4STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models
- Categorized continuous variables should be
entered in regression models as a series of
indicator variables - for each category a variable is created in which
observations falling in that category are coded
1" and all other observations are coded 0",
thus the variable is represented in the model as
a series of indicator terms, with the reference
category left out of the model. - Any categories of a variable that get left out of
the model become part of the reference group
(because those observations will be coded 0 for
each indicator term left in the model).
5STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models
- If categorized continuous variables are entered
in models as if they were continuous, that is, as
one term rather than a series of indicator
variables, the program will treat the values as a
continuous distribution, with each observation in
a category having the same value. The resulting
odds ratio will correspond to each one unit
increase in the category coding. This will not
produce a meaningful result unless the coding can
be interpreted as linear increments from one
category to another. - STATA has a convenient command that makes it
unnecessary to create the indicator terms for
multilevel categorical variables. The xi
command creates a series of indicator variables
for variables marked i.variablename by
recognizing each value as a category. When used
with the logistic or logit commands, STATA uses
the lowest value as the reference category, which
it drops out of the model. It is necessary to
make sure that the variable coding reflects the
desired categorization and reference level.
6- xilogistic sil i.partner
- i.partner
- Logit estimates
Number of obs 580 -
LR chi2(12) 51.48 -
Prob gt chi2 0.0000 - Log likelihood -374.7638
Pseudo R2 0.0643 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Ipartner_2 2.008163 .5687573 2.46
0.014 1.152708 3.498473 - _Ipartner_3 2.231293 .6428015 2.79
0.005 1.26864 3.924413 - _Ipartner_4 2.40158 .7180631 2.93
0.003 1.336568 4.315221 - _Ipartner_5 3.322094 1.050245 3.80
0.000 1.787778 6.173199 - _Ipartner_6 2.693878 1.338111 2.00
0.046 1.017575 7.13164 - _Ipartner_7 7.836735 6.346424 2.54
0.011 1.602531 38.32338 - _Ipartner_9 13.71429 14.85759 2.42
0.016 1.64063 114.6399 - _Ipartner_10 7.183673 2.981345 4.75
0.000 3.184811 16.20353 - _Ipartner_12 5.877551 6.864814 1.52
0.129 .5956855 57.99303
7Note constant from logit model ln(odds of
reference category) logit ln(odds)-constant,
constant ln(odds of reference
value ORexp(logit)
8Interpreting logistic regression model
coefficients for continuous variables
- When a logistic regression model contains a
continuous independent variable, interpretation
of the estimated coefficient depends on how it is
entered into the model and the particular units
of the variable - To interpret the coefficient, we assume that the
logit is linear in the variable - The slope coefficient gives the change in the log
odds for an increase of 1 unit in x.
9Interpreting logistic regression model
coefficients for continuous variables
- Sometimes a unit increase may not be meaningful
or considered important - If we are interested in estimating the increased
odds instead for every 5 year increase. We can
use the formula - OR (c)Exp(cß1)
- (95 CIexp(cß11.96cSEß1)
10P-values for Trend
- The term trend generally refers to a monotonic,
though not necessarily linear, association
between increasing levels of exposure and the
probability of the outcome. - Although examining effect estimates and
confidence intervals over levels of exposure is
the most informative manner of evaluating a
dose-response trend, it has been conventional to
report p-values for the null hypothesis of no
monotonic association between exposure and
disease, that is, a test for trend. - The p-value corresponding to the coefficient of a
variable entered on an appropriate continuous
scale in a regression model can be interpreted as
a p-value for trend. For statistical hypothesis
testing of trend in stratified analysis, see
Rothman and Greenland.
11- . logit sil lifepart
- Iteration 0 log likelihood -407.27031
- Iteration 1 log likelihood -398.61876
- Iteration 2 log likelihood -397.39561
- Iteration 3 log likelihood -397.33389
- Iteration 4 log likelihood -397.33374
- Logit estimates
Number of obs 591 -
LR chi2(1) 19.87 -
Prob gt chi2 0.0000 - Log likelihood -397.33374
Pseudo R2 0.0244 - --------------------------------------------------
---------------------------- - sil Coef. Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - lifepart .0493962 .0139827 3.53
0.000 .0219906 .0768017 - _cons -.0989464 .1091532 -0.91
0.365 -.3128828 .11499 - --------------------------------------------------
----------------------------
12Interpretation
- For every increase in the number of partners, the
risk or odds of SIL increases 5. - If we are interested in estimating the increased
odds instead for every 5 partners. We can use the
formula - OR (5)Exp(5 .0493962)3.43
- For every increase of 5 additional partners, the
odds for SIL increases 3.43 times - Be careful.. Validity of statements may be
questionable since additional risk of SIL for 10
partners may be quite different for 5 partners,
but this is an unavoidable dilemma when
continuous variables are modeled linearly in the
logit.
13Significance testing
- Does the model that includes the variable in
question tell more about the outcome variable
than a model that does not include that variable? - In general you are comparing observed values of
the response variable to the predicted values
obtained from models with and without the
variables in question - Log likelihood ratio test is obtained by
multiplying the difference between the two values
by -2 - Follows a chi square distribution with the
degrees of freedom the difference between the
number of parameters in the 2 models
14LR Test
. xilogistic sil i.hpv2 i.hpv2
_Ihpv2_0-1 (naturally coded _Ihpv2_0
omitted) Logit estimates
Number of obs 595
LR
chi2(1) 155.15
Prob gt chi2
0.0000 Log likelihood -332.30027
Pseudo R2 0.1893 -----------------
--------------------------------------------------
----------- sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval ---------
-------------------------------------------------
------------------- _Ihpv2_1 17.30026
5.121828 9.63 0.000 9.683894
30.90687 -----------------------------------------
------------------------------------- .
lrtest,saving (0)
15LR Test
- . xilogistic sil i.hpv2 income
- i.hpv2 _Ihpv2_0-1 (naturally
coded _Ihpv2_0 omitted) - Logit estimates
Number of obs 592 -
LR chi2(2) 156.07 -
Prob gt chi2 0.0000 - Log likelihood -329.84288
Pseudo R2 0.1913 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Ihpv2_1 17.3712 5.14748 9.63
0.000 9.718504 31.04989 - income .9664517 .0766365 -0.43
0.667 .8273373 1.128958 - --------------------------------------------------
---------------------------- - . lrtest, using (0)
- Warning observations differ 595 vs. 592
- Logistic likelihood-ratio test
chi2(-1) -4.91 -
Prob gt chi2 .
16LR Test adding income to the model
- -2(-332.30027 (-329.84288))2.457
- P08
17Wald test
- Obtained by comparing the maximum likelihood
estimate of the slope parameter, ß1, to an
estimate of the standard error - W ß1/SE (ß1)
- A Z test, the distribution is approximately
standard normal - Gives approximately equal answer to the LR test
in large samples but may be different in smaller
samples - LR test seems to perform better in most situations
18Variable selection
- Once we have obtained a model that contains
essential variables, examine the variables in the
model more closely - Then check for interactions- For example an
interaction between HPV status and race would
indicate that the slope coefficient for HPV would
be different by race/ethnicity
19Interaction assessment
- Can take many forms
- When interaction is present, the association
between the risk factor and the outcome variable
differs, or depends in some way on the level of
the covariate - The covariate modifies the effect of the risk
factor - Consider a model with a dichotomous risk factor
and the covariate, age. If the association
between the covariate (age) and the outcome
variable is the same within each level of the
risk factor, then there is no interaction between
the covariate and the risk factor - Graphically, no interaction yields a model with
parallel lines
20Interaction assessment
- Need to be HWF
- Decide whether the effect of each variable varies
importantly across race/ethnic categories. - To make this decision, first note the magnitude
of the difference between the ORs across strata. - The tests of homogeneity have low power for
detecting moderate effect-measure modification. - The p-value on this test indicates whether there
is adequate statistical power in the data to
detect a difference, but a high p-value does not
mean there is no effect-measure modification. - In general, the significance level for
heterogeneity worth exploring further should be
set around 0.20-0.25
21Observe stratum-specific estimates using
stratified analysis.
- cc sil hpv2, by (race)
- OR 95
Conf. Interval M-H Weight - -------------------------------------------------
----------------- - white 28.88889 8.69751
148.2404 .8790698 (exact) - african- 7.520408 2.891786
21.71961 1.973154 (exact) - hispanic 20.5 6.885033
81.30018 1.073593 (exact) - -------------------------------------------------
----------------- - Crude 17.30026 9.556989
33.34827 (exact) - M-H combined 15.85477 8.815859
28.51382 - --------------------------------------------------
----------------- - Test of homogeneity (M-H) chi2(2) 3.80
Prgtchi2 0.1494 - Test that combined OR 1
- Mantel-Haenszel chi2(1)
123.18 - Prgtchi2 0.0000
22Interaction assessment
- Use product terms in a multivariable logistic
regression model in order to identify potential
effect-measure modification (interaction) while
adjusting for confounders. - The p-value on the product term can be
interpreted as a test of homogeneity. - To model a product term for two continuous
variables, a term must be created for the product
of the two variables. The product term is entered
into the model, along with each of the two
variables. - The xi command in STATA creates all of the
required product terms for modeling interaction
if at least one of the two variables is
categorical. With this command the two variables
do not have to be entered separately in the model
because STATA does it for you. - When entering a product term between a
categorical and a continuous variable in a
logistic regression model, we evaluate whether
the entire dose-response of the continuous
variable differs across strata of the categorized
variable.
23- . xilogistic sil i.hpv2 i.race
- Logit estimates
Number of obs 595 -
LR chi2(3) 166.35 -
Prob gt chi2 0.0000 - Log likelihood -326.70089 Pseudo R2
0.2029 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Ihpv2_1 15.86857 4.726791 9.28
0.000 8.850939 28.45026 - _Irace_2 .5410994 .1348136 -2.47
0.014 .3320491 .8817628 - _Irace_3 .4955735 .1097656 -3.17
0.002 .3210507 .7649667 - --------------------------------------------------
----------------------------
24- . xilogistic sil i.hpv2 i.race i.racehpv2
- Logit estimates
Number of obs 595 -
LR chi2(5) 170.04 -
Prob gt chi2 0.0000 - Log likelihood -324.85603
Pseudo R2 0.2074 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Ihpv2_1 28.88889 17.72576 5.48
0.000 8.678548 96.16446 - _Irace_2 .6467662 .1711091 -1.65
0.100 .3850814 1.086281 - _Irace_3 .515873 .1214782 -2.81
0.005 .3251631 .8184354 - _IracXhpv2_2 .2603218 .1996824 -1.75
0.079 .057888 1.170666 - _IracXhpv2_3 .7096154 .5830384 -0.42
0.676 .1417926 3.551341 - --------------------------------------------------
----------------------------
25Interaction assessment
- Use chunk test for entire collection of
interaction terms - Use LR test comparing main effects model with
fuller model - . lrtest, using (0)
- Logistic likelihood-ratio test chi2(-2)
-3.69 - Prob
gt chi2 0.15 - If the interaction term is not significant then
drop the interaction term
26Interaction Assessment
- If the interaction term is retained in the model,
the estimated ORs for other variables confounded
by race or another modifier should not be
obtained from a model that enters race or
another modifier in suboptimal form for the
purpose of obtaining stratum-specific estimates. - In an analysis that aims to estimate effects of
several variables, we may use several different
models to estimate the effects of interest. In
this case, our goal is not the elaboration of a
final model.
27Confounding Assessment
- Confounding is used to describe a covariate that
is associated both with the outcome variable of
interest and the primary independent variable or
risk factor of interest, but is not an
intermediate variable in the causal pathway - When both variables are present, the relationship
is said to be confounded - Only appropriate when there is no interaction
- Decisions regarding whether or not to adjust for
potential confounding variables will depend on a
combined assessment of prior knowledge, observed
associations in the data, and sample size
considerations
28Confounding Assessment
- In practice we look for empirical evidence of
confounding in data obtained from study
populations, however, we must keep in mind that
what we observe in such data may reflect
selection and information bias affecting observed
confounder-disease-exposure associations in a
similar way to how these biases affect observed
exposure-disease associations. - Therefore, it is necessary to rely on prior
knowledge of relevant associations in source
populations. For example, if a variable is a
known confounder but does not appear to be one in
the data, this should create uncertainty
regarding the validity of the data. Adjusting for
this factor may not change the relative risk
point estimate, but it may influence the standard
error for this estimate, thus appropriately
reflecting our uncertainty. - If prior knowledge suggests a variable should not
be a confounder but it appears to be one in the
data, the confounding may have been introduced by
the study methods (eg., as a result of matching
in a case-control design). In this case it would
be appropriate to adjust for this factor, if it
is not an intervening (intermediate) variable.
29Confounding Assessment
- When prior knowledge regarding exposure-covariate
associations is insufficient and the number of
covariates to consider is small, it may be
desirable to adjust for all variables that appear
to be important risk factors for the outcome, as
long as they are not intervening variables. - When a large number of potential confounders must
be considered, the change-in-estimate variable
selection strategy has been shown in simulation
studies to produce more valid results for
confounder detection than strategies that rely on
p-values, unless the significance level for the
p-value is raised to 0.2 or higher. There is more
than one reasonable approach to variable
selection in such situations the important thing
is that the criterion for selection should be
explicit and consistently applied. - When examining continuous variables as potential
confounders, careful assessment of the
dose-response for the purpose of choosing the
optimal scale or categorization should be carried
out prior to the assessment of confounding.
Inappropriate modeling of continuous variables
may lead to incorrect decisions about whether or
not to adjust for these variables and/or to
inadequate control of confounding.
30Confounding Assessment
- Begin with a model with the exposure-disease
relationship For example if we are interested in
the association between number of lifetime
partners and SIL - . xilogistic sil i.part1_4
- i.part1_4 _Ipart1_4_0-3 (naturally
coded _Ipart1_4_0 omitted) - Logit estimates
Number of obs 593 -
LR chi2(3) 47.91 -
Prob gt chi2 0.0000 - Log likelihood -384.71
Pseudo R2 0.0586 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Ipart1_4_1 2.113856 .5010233 3.16
0.002 1.328387 3.363768 - _Ipart1_4_2 2.793651 .6974642 4.11
0.000 1.712619 4.557047 - _Ipart1_4_3 5.120594 1.277563 6.55
0.000 3.140146 8.350084 - -------
-
31Confounding Assessment
- . . xilogistic sil i.curr_smk
- i.curr_smk _Icurr_smk_0-1 (naturally
coded _Icurr_smk_0 omitted) - Logit estimates
Number of obs 595 -
LR chi2(1) 21.26 -
Prob gt chi2 0.0000 - Log likelihood -399.24786
Pseudo R2 0.0259 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Icurr_smk_1 2.349306 .4454758 4.50
0.000 1.620073 3.406784 - --------------------------------------------------
---------------------------- - .
32Confounding Assessment
- . xilogistic sil i.curr_smk i.hpv2
- Logit estimates
Number of obs 595 -
LR chi2(2) 161.58 - Prob gt
chi2 0.0000 - Log likelihood -329.08559
Pseudo R2 0.1971 - --------------------------------------------------
---------------------------- - sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - _Icurr_smk_1 1.724126 .3709956 2.53
0.011 1.130859 2.628631 - _Ihpv2_1 15.96255 4.749559 9.31
0.000 8.909072 28.6004 - --------------------------------------------------
----------------------------
33Confounding Assessment
- Examine the OR for the main risk factor to
determine whether it is meaningfully different
than the OR when controlling for confounding - For previous examples the ORs are 2.34 vs. 1.72
- Can examine using the LR test
- May want to include the confounder if others
would not trust the results if did not perform
the adjustment - May want to use a decision rule, depending on the
subject matter (i.e., 10 change in the OR) and
report the criterion - Several have suggested a backward deletion
strategy whereby you enter all main effects and
confounders in the model and you eliminate
one-by-one confounder which makes the smallest
difference in the exposure-effect estimate - Biases resulting from multiple confounders may
cancel each other out or produce results that may
not be easy to disentangle. Consider the
following.
34Model Checking
- When run diagnostics and why?
- This depends on what we want the model to do
- If our goal is to obtain approximately valid
summary estimates of effect for a few key
relationships, less rigorous checking is required - If our goal is to predict outcomes for a given
set of factors, more detailed checking is required
35Model Checking
- Assuming the model contains those variables that
should be in the model, we want to know how
effectively the model we have describes the data
(goodness of fit) - A good-fitting model is not the same as a correct
model - Regression diagnostics can detect discrepancies
between a model and data only within the range of
the data and only if there are enough
observations for adequate diagnostic power - A model may appear to fit well in the central
range of the data, but produce poor predictions
for covariate values that are not well
represented in the data
36Model Checking
- Computation and evaluation of overall measures
- Examination of individual components of the
summary statistics, often graphically - Examination of other measures of the difference
between components of the observed values versus
the predicted values
37Model Checking
- Check model results against results from
stratified analysis - Log-likelihood Ratio Test (also called Deviance
Test) - Tests of Regression
- Test the hypothesis that all the regression
coefficients (except the intercept) are zero - The R2 not recommended o use- may give a
distorted impression
38Model Checking
- Tests of fit (see RG, pp. 409-10 for test
details) - Tests for nonrandom incompatibilities between a
model and data - Compares the fit of an index model to a more
elaborate reference model that contains it - Small p-value suggests that the index model fits
poorly relative to the reference model, that is,
that the additional terms in the reference model
improve the fit - Large p-value does not mean that the index model
fits well, only that the test did not detect an
improvement in fit from the additional terms in
the reference model - Pearson residual (examining obs-pred/SE)
- Deviance residual
- Hosmer-Lemeshow goodness of fit statistic
- A grouping-based method based on values of the
estimated probabilities - Classification tables
- Table is the result of cross-classifying the
outcome variable with a dichotomous variable
whose values are derived from the estimated
logistic probabilities
39Multicolinearity
- Occurs when one or more independent variables can
be approximately determined by some other
variables in the model - When there is multicolinearity, the estimated
regression coefficients of the fitted model can
be highly unreliable
40Multiple testing
- Occurs from the many tests of significance that
are typically carried out when selecting or
eliminating variables from the model - More likely to obtain statistical significance
even is no real association exists
41Influential observations
- Refers to data on individuals that may have a
large influence on the estimated regression
coefficients - Methods assessing the possibility of influential
observations should be considered