Logistic Regression in STATA - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Logistic Regression in STATA

Description:

The resulting odds ratio will correspond to each one unit increase in the category coding. ... The p-value on the product term can be interpreted as a test of ... – PowerPoint PPT presentation

Number of Views:1576
Avg rating:3.0/5.0
Slides: 42
Provided by: stort
Category:

less

Transcript and Presenter's Notes

Title: Logistic Regression in STATA


1
Logistic Regression in STATA
  • The logistic regression programs in STATA use
    maximum likelihood estimation to generate the
    logit (the logistic regression coefficient, which
    corresponds to the natural log of the OR for each
    one-unit increase in the level of the regressor
    variable).
  • The resulting ORs are maximum-likelihood
    estimates (MLEs) of the uniform effect (OR)
    across strata of the model covariates. Thus they
    are pooled (uniform, common) estimates of the OR
    and in this sense are adjusted for all regressors
    included in the model.

2
STATA Logistic Regression Commands
  • The logistic command in STATA yields odds
    ratios.
  • . logistic sil hpv2 age
  • Logit estimates
    Number of obs 595

  • LR chi2(2) 155.28

  • Prob gt chi2 0.0000
  • Log likelihood -332.23774 Pseudo R2
    0.1894
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • hpv2 17.23939 5.10605 9.61
    0.000 9.647363 30.80598
  • age .9947965 .0146842 -0.35
    0.724 .9664283 1.023997
  • --------------------------------------------------
    ----------------------------

3
STATA Logistic Regression Commands
  • The logit command in STATA yields the actual
    beta coefficients.
  • logit sil hpv2 age
  • Logit estimates
    Number of obs 595

  • LR chi2(2) 155.28

  • Prob gt chi2 0.0000
  • Log likelihood -332.23774 Pseudo
    R2 0.1894
  • --------------------------------------------------
    ----------------------------
  • sil Coef. Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • hpv2 2.847197 .2961851 9.61
    0.000 2.266685 3.427709
  • age -.0052171 .014761 -0.35
    0.724 -.0341482 .0237139
  • _cons -.2873639 .4075825 -0.71
    0.481 -1.086211 .5114832
  • --------------------------------------------------
    ----------------------------
  • Note that 17.23939 exp2.847197 as in the
    previous slide

4
STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models
  • Categorized continuous variables should be
    entered in regression models as a series of
    indicator variables
  • for each category a variable is created in which
    observations falling in that category are coded
    1" and all other observations are coded 0",
    thus the variable is represented in the model as
    a series of indicator terms, with the reference
    category left out of the model.
  • Any categories of a variable that get left out of
    the model become part of the reference group
    (because those observations will be coded 0 for
    each indicator term left in the model).

5
STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models
  • If categorized continuous variables are entered
    in models as if they were continuous, that is, as
    one term rather than a series of indicator
    variables, the program will treat the values as a
    continuous distribution, with each observation in
    a category having the same value. The resulting
    odds ratio will correspond to each one unit
    increase in the category coding. This will not
    produce a meaningful result unless the coding can
    be interpreted as linear increments from one
    category to another.
  • STATA has a convenient command that makes it
    unnecessary to create the indicator terms for
    multilevel categorical variables. The xi
    command creates a series of indicator variables
    for variables marked i.variablename by
    recognizing each value as a category. When used
    with the logistic or logit commands, STATA uses
    the lowest value as the reference category, which
    it drops out of the model. It is necessary to
    make sure that the variable coding reflects the
    desired categorization and reference level.

6
  • xilogistic sil i.partner
  • i.partner
  • Logit estimates
    Number of obs 580

  • LR chi2(12) 51.48

  • Prob gt chi2 0.0000
  • Log likelihood -374.7638
    Pseudo R2 0.0643
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err.
    z Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Ipartner_2 2.008163 .5687573 2.46
    0.014 1.152708 3.498473
  • _Ipartner_3 2.231293 .6428015 2.79
    0.005 1.26864 3.924413
  • _Ipartner_4 2.40158 .7180631 2.93
    0.003 1.336568 4.315221
  • _Ipartner_5 3.322094 1.050245 3.80
    0.000 1.787778 6.173199
  • _Ipartner_6 2.693878 1.338111 2.00
    0.046 1.017575 7.13164
  • _Ipartner_7 7.836735 6.346424 2.54
    0.011 1.602531 38.32338
  • _Ipartner_9 13.71429 14.85759 2.42
    0.016 1.64063 114.6399
  • _Ipartner_10 7.183673 2.981345 4.75
    0.000 3.184811 16.20353
  • _Ipartner_12 5.877551 6.864814 1.52
    0.129 .5956855 57.99303

7
Note constant from logit model ln(odds of
reference category) logit ln(odds)-constant,
constant ln(odds of reference
value ORexp(logit)
8
Interpreting logistic regression model
coefficients for continuous variables
  • When a logistic regression model contains a
    continuous independent variable, interpretation
    of the estimated coefficient depends on how it is
    entered into the model and the particular units
    of the variable
  • To interpret the coefficient, we assume that the
    logit is linear in the variable
  • The slope coefficient gives the change in the log
    odds for an increase of 1 unit in x.

9
Interpreting logistic regression model
coefficients for continuous variables
  • Sometimes a unit increase may not be meaningful
    or considered important
  • If we are interested in estimating the increased
    odds instead for every 5 year increase. We can
    use the formula
  • OR (c)Exp(cß1)
  • (95 CIexp(cß11.96cSEß1)

10
P-values for Trend
  • The term trend generally refers to a monotonic,
    though not necessarily linear, association
    between increasing levels of exposure and the
    probability of the outcome.
  • Although examining effect estimates and
    confidence intervals over levels of exposure is
    the most informative manner of evaluating a
    dose-response trend, it has been conventional to
    report p-values for the null hypothesis of no
    monotonic association between exposure and
    disease, that is, a test for trend.
  • The p-value corresponding to the coefficient of a
    variable entered on an appropriate continuous
    scale in a regression model can be interpreted as
    a p-value for trend. For statistical hypothesis
    testing of trend in stratified analysis, see
    Rothman and Greenland.

11
  • . logit sil lifepart
  • Iteration 0 log likelihood -407.27031
  • Iteration 1 log likelihood -398.61876
  • Iteration 2 log likelihood -397.39561
  • Iteration 3 log likelihood -397.33389
  • Iteration 4 log likelihood -397.33374
  • Logit estimates
    Number of obs 591

  • LR chi2(1) 19.87

  • Prob gt chi2 0.0000
  • Log likelihood -397.33374
    Pseudo R2 0.0244
  • --------------------------------------------------
    ----------------------------
  • sil Coef. Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • lifepart .0493962 .0139827 3.53
    0.000 .0219906 .0768017
  • _cons -.0989464 .1091532 -0.91
    0.365 -.3128828 .11499
  • --------------------------------------------------
    ----------------------------

12
Interpretation
  • For every increase in the number of partners, the
    risk or odds of SIL increases 5.
  • If we are interested in estimating the increased
    odds instead for every 5 partners. We can use the
    formula
  • OR (5)Exp(5 .0493962)3.43
  • For every increase of 5 additional partners, the
    odds for SIL increases 3.43 times
  • Be careful.. Validity of statements may be
    questionable since additional risk of SIL for 10
    partners may be quite different for 5 partners,
    but this is an unavoidable dilemma when
    continuous variables are modeled linearly in the
    logit.

13
Significance testing
  • Does the model that includes the variable in
    question tell more about the outcome variable
    than a model that does not include that variable?
  • In general you are comparing observed values of
    the response variable to the predicted values
    obtained from models with and without the
    variables in question
  • Log likelihood ratio test is obtained by
    multiplying the difference between the two values
    by -2
  • Follows a chi square distribution with the
    degrees of freedom the difference between the
    number of parameters in the 2 models

14
LR Test
. xilogistic sil i.hpv2 i.hpv2
_Ihpv2_0-1 (naturally coded _Ihpv2_0
omitted) Logit estimates
Number of obs 595
LR
chi2(1) 155.15
Prob gt chi2
0.0000 Log likelihood -332.30027
Pseudo R2 0.1893 -----------------
--------------------------------------------------
----------- sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval ---------
-------------------------------------------------
------------------- _Ihpv2_1 17.30026
5.121828 9.63 0.000 9.683894
30.90687 -----------------------------------------
------------------------------------- .
lrtest,saving (0)
15
LR Test
  • . xilogistic sil i.hpv2 income
  • i.hpv2 _Ihpv2_0-1 (naturally
    coded _Ihpv2_0 omitted)
  • Logit estimates
    Number of obs 592

  • LR chi2(2) 156.07

  • Prob gt chi2 0.0000
  • Log likelihood -329.84288
    Pseudo R2 0.1913
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Ihpv2_1 17.3712 5.14748 9.63
    0.000 9.718504 31.04989
  • income .9664517 .0766365 -0.43
    0.667 .8273373 1.128958
  • --------------------------------------------------
    ----------------------------
  • . lrtest, using (0)
  • Warning observations differ 595 vs. 592
  • Logistic likelihood-ratio test
    chi2(-1) -4.91

  • Prob gt chi2 .

16
LR Test adding income to the model
  • -2(-332.30027 (-329.84288))2.457
  • P08

17
Wald test
  • Obtained by comparing the maximum likelihood
    estimate of the slope parameter, ß1, to an
    estimate of the standard error
  • W ß1/SE (ß1)
  • A Z test, the distribution is approximately
    standard normal
  • Gives approximately equal answer to the LR test
    in large samples but may be different in smaller
    samples
  • LR test seems to perform better in most situations

18
Variable selection
  • Once we have obtained a model that contains
    essential variables, examine the variables in the
    model more closely
  • Then check for interactions- For example an
    interaction between HPV status and race would
    indicate that the slope coefficient for HPV would
    be different by race/ethnicity

19
Interaction assessment
  • Can take many forms
  • When interaction is present, the association
    between the risk factor and the outcome variable
    differs, or depends in some way on the level of
    the covariate
  • The covariate modifies the effect of the risk
    factor
  • Consider a model with a dichotomous risk factor
    and the covariate, age. If the association
    between the covariate (age) and the outcome
    variable is the same within each level of the
    risk factor, then there is no interaction between
    the covariate and the risk factor
  • Graphically, no interaction yields a model with
    parallel lines

20
Interaction assessment
  • Need to be HWF
  • Decide whether the effect of each variable varies
    importantly across race/ethnic categories.
  • To make this decision, first note the magnitude
    of the difference between the ORs across strata.
  • The tests of homogeneity have low power for
    detecting moderate effect-measure modification.
  • The p-value on this test indicates whether there
    is adequate statistical power in the data to
    detect a difference, but a high p-value does not
    mean there is no effect-measure modification.
  • In general, the significance level for
    heterogeneity worth exploring further should be
    set around 0.20-0.25

21
Observe stratum-specific estimates using
stratified analysis.
  • cc sil hpv2, by (race)
  • OR 95
    Conf. Interval M-H Weight
  • -------------------------------------------------
    -----------------
  • white 28.88889 8.69751
    148.2404 .8790698 (exact)
  • african- 7.520408 2.891786
    21.71961 1.973154 (exact)
  • hispanic 20.5 6.885033
    81.30018 1.073593 (exact)
  • -------------------------------------------------
    -----------------
  • Crude 17.30026 9.556989
    33.34827 (exact)
  • M-H combined 15.85477 8.815859
    28.51382
  • --------------------------------------------------
    -----------------
  • Test of homogeneity (M-H) chi2(2) 3.80
    Prgtchi2 0.1494
  • Test that combined OR 1
  • Mantel-Haenszel chi2(1)
    123.18
  • Prgtchi2 0.0000

22
Interaction assessment
  • Use product terms in a multivariable logistic
    regression model in order to identify potential
    effect-measure modification (interaction) while
    adjusting for confounders.
  • The p-value on the product term can be
    interpreted as a test of homogeneity.
  • To model a product term for two continuous
    variables, a term must be created for the product
    of the two variables. The product term is entered
    into the model, along with each of the two
    variables.
  • The xi command in STATA creates all of the
    required product terms for modeling interaction
    if at least one of the two variables is
    categorical. With this command the two variables
    do not have to be entered separately in the model
    because STATA does it for you.
  • When entering a product term between a
    categorical and a continuous variable in a
    logistic regression model, we evaluate whether
    the entire dose-response of the continuous
    variable differs across strata of the categorized
    variable.

23
  • . xilogistic sil i.hpv2 i.race
  • Logit estimates
    Number of obs 595

  • LR chi2(3) 166.35

  • Prob gt chi2 0.0000
  • Log likelihood -326.70089 Pseudo R2
    0.2029
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Ihpv2_1 15.86857 4.726791 9.28
    0.000 8.850939 28.45026
  • _Irace_2 .5410994 .1348136 -2.47
    0.014 .3320491 .8817628
  • _Irace_3 .4955735 .1097656 -3.17
    0.002 .3210507 .7649667
  • --------------------------------------------------
    ----------------------------

24
  • . xilogistic sil i.hpv2 i.race i.racehpv2
  • Logit estimates
    Number of obs 595

  • LR chi2(5) 170.04

  • Prob gt chi2 0.0000
  • Log likelihood -324.85603
    Pseudo R2 0.2074
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err.
    z Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Ihpv2_1 28.88889 17.72576 5.48
    0.000 8.678548 96.16446
  • _Irace_2 .6467662 .1711091 -1.65
    0.100 .3850814 1.086281
  • _Irace_3 .515873 .1214782 -2.81
    0.005 .3251631 .8184354
  • _IracXhpv2_2 .2603218 .1996824 -1.75
    0.079 .057888 1.170666
  • _IracXhpv2_3 .7096154 .5830384 -0.42
    0.676 .1417926 3.551341
  • --------------------------------------------------
    ----------------------------

25
Interaction assessment
  • Use chunk test for entire collection of
    interaction terms
  • Use LR test comparing main effects model with
    fuller model
  • . lrtest, using (0)
  • Logistic likelihood-ratio test chi2(-2)
    -3.69
  • Prob
    gt chi2 0.15
  • If the interaction term is not significant then
    drop the interaction term

26
Interaction Assessment
  • If the interaction term is retained in the model,
    the estimated ORs for other variables confounded
    by race or another modifier should not be
    obtained from a model that enters race or
    another modifier in suboptimal form for the
    purpose of obtaining stratum-specific estimates.
  • In an analysis that aims to estimate effects of
    several variables, we may use several different
    models to estimate the effects of interest. In
    this case, our goal is not the elaboration of a
    final model.

27
Confounding Assessment
  • Confounding is used to describe a covariate that
    is associated both with the outcome variable of
    interest and the primary independent variable or
    risk factor of interest, but is not an
    intermediate variable in the causal pathway
  • When both variables are present, the relationship
    is said to be confounded
  • Only appropriate when there is no interaction
  • Decisions regarding whether or not to adjust for
    potential confounding variables will depend on a
    combined assessment of prior knowledge, observed
    associations in the data, and sample size
    considerations

28
Confounding Assessment
  • In practice we look for empirical evidence of
    confounding in data obtained from study
    populations, however, we must keep in mind that
    what we observe in such data may reflect
    selection and information bias affecting observed
    confounder-disease-exposure associations in a
    similar way to how these biases affect observed
    exposure-disease associations.
  • Therefore, it is necessary to rely on prior
    knowledge of relevant associations in source
    populations. For example, if a variable is a
    known confounder but does not appear to be one in
    the data, this should create uncertainty
    regarding the validity of the data. Adjusting for
    this factor may not change the relative risk
    point estimate, but it may influence the standard
    error for this estimate, thus appropriately
    reflecting our uncertainty.
  • If prior knowledge suggests a variable should not
    be a confounder but it appears to be one in the
    data, the confounding may have been introduced by
    the study methods (eg., as a result of matching
    in a case-control design). In this case it would
    be appropriate to adjust for this factor, if it
    is not an intervening (intermediate) variable.

29
Confounding Assessment
  • When prior knowledge regarding exposure-covariate
    associations is insufficient and the number of
    covariates to consider is small, it may be
    desirable to adjust for all variables that appear
    to be important risk factors for the outcome, as
    long as they are not intervening variables.
  • When a large number of potential confounders must
    be considered, the change-in-estimate variable
    selection strategy has been shown in simulation
    studies to produce more valid results for
    confounder detection than strategies that rely on
    p-values, unless the significance level for the
    p-value is raised to 0.2 or higher. There is more
    than one reasonable approach to variable
    selection in such situations the important thing
    is that the criterion for selection should be
    explicit and consistently applied.
  • When examining continuous variables as potential
    confounders, careful assessment of the
    dose-response for the purpose of choosing the
    optimal scale or categorization should be carried
    out prior to the assessment of confounding.
    Inappropriate modeling of continuous variables
    may lead to incorrect decisions about whether or
    not to adjust for these variables and/or to
    inadequate control of confounding.

30
Confounding Assessment
  • Begin with a model with the exposure-disease
    relationship For example if we are interested in
    the association between number of lifetime
    partners and SIL
  • . xilogistic sil i.part1_4
  • i.part1_4 _Ipart1_4_0-3 (naturally
    coded _Ipart1_4_0 omitted)
  • Logit estimates
    Number of obs 593

  • LR chi2(3) 47.91

  • Prob gt chi2 0.0000
  • Log likelihood -384.71
    Pseudo R2 0.0586
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Ipart1_4_1 2.113856 .5010233 3.16
    0.002 1.328387 3.363768
  • _Ipart1_4_2 2.793651 .6974642 4.11
    0.000 1.712619 4.557047
  • _Ipart1_4_3 5.120594 1.277563 6.55
    0.000 3.140146 8.350084
  • -------

31
Confounding Assessment
  • . . xilogistic sil i.curr_smk
  • i.curr_smk _Icurr_smk_0-1 (naturally
    coded _Icurr_smk_0 omitted)
  • Logit estimates
    Number of obs 595

  • LR chi2(1) 21.26

  • Prob gt chi2 0.0000
  • Log likelihood -399.24786
    Pseudo R2 0.0259
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err. z
    Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Icurr_smk_1 2.349306 .4454758 4.50
    0.000 1.620073 3.406784
  • --------------------------------------------------
    ----------------------------
  • .

32
Confounding Assessment
  • . xilogistic sil i.curr_smk i.hpv2
  • Logit estimates
    Number of obs 595

  • LR chi2(2) 161.58
  • Prob gt
    chi2 0.0000
  • Log likelihood -329.08559
    Pseudo R2 0.1971
  • --------------------------------------------------
    ----------------------------
  • sil Odds Ratio Std. Err.
    z Pgtz 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _Icurr_smk_1 1.724126 .3709956 2.53
    0.011 1.130859 2.628631
  • _Ihpv2_1 15.96255 4.749559 9.31
    0.000 8.909072 28.6004
  • --------------------------------------------------
    ----------------------------

33
Confounding Assessment
  • Examine the OR for the main risk factor to
    determine whether it is meaningfully different
    than the OR when controlling for confounding
  • For previous examples the ORs are 2.34 vs. 1.72
  • Can examine using the LR test
  • May want to include the confounder if others
    would not trust the results if did not perform
    the adjustment
  • May want to use a decision rule, depending on the
    subject matter (i.e., 10 change in the OR) and
    report the criterion
  • Several have suggested a backward deletion
    strategy whereby you enter all main effects and
    confounders in the model and you eliminate
    one-by-one confounder which makes the smallest
    difference in the exposure-effect estimate
  • Biases resulting from multiple confounders may
    cancel each other out or produce results that may
    not be easy to disentangle. Consider the
    following.

34
Model Checking
  • When run diagnostics and why?
  • This depends on what we want the model to do
  • If our goal is to obtain approximately valid
    summary estimates of effect for a few key
    relationships, less rigorous checking is required
  • If our goal is to predict outcomes for a given
    set of factors, more detailed checking is required

35
Model Checking
  • Assuming the model contains those variables that
    should be in the model, we want to know how
    effectively the model we have describes the data
    (goodness of fit)
  • A good-fitting model is not the same as a correct
    model
  • Regression diagnostics can detect discrepancies
    between a model and data only within the range of
    the data and only if there are enough
    observations for adequate diagnostic power
  • A model may appear to fit well in the central
    range of the data, but produce poor predictions
    for covariate values that are not well
    represented in the data

36
Model Checking
  • Computation and evaluation of overall measures
  • Examination of individual components of the
    summary statistics, often graphically
  • Examination of other measures of the difference
    between components of the observed values versus
    the predicted values

37
Model Checking
  • Check model results against results from
    stratified analysis
  • Log-likelihood Ratio Test (also called Deviance
    Test)
  • Tests of Regression
  • Test the hypothesis that all the regression
    coefficients (except the intercept) are zero
  • The R2 not recommended o use- may give a
    distorted impression

38
Model Checking
  • Tests of fit (see RG, pp. 409-10 for test
    details)
  • Tests for nonrandom incompatibilities between a
    model and data
  • Compares the fit of an index model to a more
    elaborate reference model that contains it
  • Small p-value suggests that the index model fits
    poorly relative to the reference model, that is,
    that the additional terms in the reference model
    improve the fit
  • Large p-value does not mean that the index model
    fits well, only that the test did not detect an
    improvement in fit from the additional terms in
    the reference model
  • Pearson residual (examining obs-pred/SE)
  • Deviance residual
  • Hosmer-Lemeshow goodness of fit statistic
  • A grouping-based method based on values of the
    estimated probabilities
  • Classification tables
  • Table is the result of cross-classifying the
    outcome variable with a dichotomous variable
    whose values are derived from the estimated
    logistic probabilities

39
Multicolinearity
  • Occurs when one or more independent variables can
    be approximately determined by some other
    variables in the model
  • When there is multicolinearity, the estimated
    regression coefficients of the fitted model can
    be highly unreliable

40
Multiple testing
  • Occurs from the many tests of significance that
    are typically carried out when selecting or
    eliminating variables from the model
  • More likely to obtain statistical significance
    even is no real association exists

41
Influential observations
  • Refers to data on individuals that may have a
    large influence on the estimated regression
    coefficients
  • Methods assessing the possibility of influential
    observations should be considered
Write a Comment
User Comments (0)
About PowerShow.com