Advanced regression analysis in Stata - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Advanced regression analysis in Stata

Description:

Mixed models: fixed and random effects. Applications of mixed ... Probit link ... Advantage over probit: we interpret the results in terms of odds (p/1-p), and ... – PowerPoint PPT presentation

Number of Views:1010
Avg rating:3.0/5.0
Slides: 37
Provided by: hermanvand
Category:

less

Transcript and Presenter's Notes

Title: Advanced regression analysis in Stata


1
Advanced regression analysis in Stata
  • ASSR Short Intensive Course
  • Herman van de Werfhorst
  • April 2009
  • Day 1 introduction to Stata and multivariate
    modelling

2
Programme
  • Day 1
  • Introduction to stata
  • Basics the linear model (OLS)
  • Regression diagnostics
  • Logistic regression (logit)
  • Day 2
  • Mixed models fixed and random effects
  • Applications of mixed models in stata

3
Introduction to stata
  • Data files .dta
  • Do files command logs
  • Ado files additional programmes developed by the
    research community
  • Log files saved output
  • Automatically generating publishable tables

4
Hands on stata
  • Creating .do file
  • Starting output log file
  • Missing values
  • Scale construction
  • Factor analysis
  • Reliability analysis

5
Scale construction
Reliability coefficient Cronbachs Alpha
Nnumber of items, r average correlation
between items
Low reliability reduces correlations.
6
Scale construction 1 The Likert scale.
  • 1. Recode variables/items in the right direction.
    High scores should indicate a high position on
    the underlying dimension
  • 2. Take the average score on all selected items

7
Example ESS2002 immigration attitudes
8
Scale construction 2 Factor analysis
Item1
Factor 1
Item2
Item3
Factor 2
Item4
Item5
Some handy info - Kim Mueller, factor
analysis (Sage publications, somewhere in the
1970s) - http//www.siu.edu/epse1/pohlmann/factgl
os/
9
OLS regression
  • Minimize sums of squared errors
  • Y a bX e

10
Coefficient of determination R2 Proportion of
variance explained by X variables
11
Diagnostics multicollinearity
  • Bivariate correlations. gt 0.7 suspicious
  • Regression of every X variable on other X
    variables (x1a b1X2 b2X3 etc)
  • R-sq larger than 0.6 critical.
  • Tolerance 1-Rsq, less than 0.4 critical.
  • 1/Tolerance Variance inflation factor (VIF),
    larger than 2.5 is critical (some say larger than
    6!)
  • VIF How much is the standard error inflated
    compared to when X variables were not correlated

12
Examine residuals
  • Studentized residual (standardized, accounting
    for possible different variances according to X)
  • Rule of thumb studentized residuals larger than
    3.61 are outliers
  • Removal leads to lower standard errors of
    estimate, higher R-sq. Non-removal makes tests
    more conservative

13
Studentized residuals
14
Outliers influential cases
  • Leverage how far the individuals X value
    differs from the mean X. The larger the value,
    the stronger the impact on determining y-hat
    values

15
Outliers (contd)
  • DFBETA (i) change in the estimate of ß when
    deleting individual i (one for each ß for each i)
  • DFFIT (i) effect on the fit of deleting
    individual i. (one for each i)

16
  • Cooks Distance impact of i on all parameter
    estimates jointly.
  • Influence of i is a function of the residual (on
    Y) and of the place in the distribution in X

17
Dfbeta in stata standardized dfbetas. gt 1
suspicious
18
(No Transcript)
19
Logistic regression
20
Non-continuous outcomes
  • Binary
  • Ordinal
  • Categorical
  • OLS has no assumptions on the distribution of the
    X variables
  • It however assumes a continuous Y variable, with
    conditional normal distribution
  • Binary outcomes Logit models logistic
    regression models
  • Part of the Generalized Linear Model framework
    (GLM)
  • If OLS regression applied to binary outcomes
    predicted probability could be lt0 or gt1 (which is
    impossible)

21
The problem of the linear probability model
22
The dependent variable
  • Outcome 0 or 1
  • (0 no 1 yes)
  • Probabilities (P) between 0 and 1.
  • If 30 of respondents are voluntary member, the
    mean probability of membership is 0.3.

23
The Generalized Linear Model (GLM)
24
GLMs
  • The function g(µ) is the link function
  • Identity link (OLS)
  • Logit link
  • Log link
  • Probit link
  • Next to a link function, you also need to specify
    the probability distribution (normal, binomial,
    gamma, etc).
  • Thus GLMs let you choose the probability
    distribution instead of assuming it.
  • Logistic regression a binomial distribution,
    with a logit link
  • Advantage over probit we interpret the results
    in terms of odds (p/1-p), and thus in odds ratios.

25
And the winner of the Best Statistic Ever Award
is.
  • The odds ratio
  • (in my opinion)
  • Advantage margin-independent association measure
    for contingency tables.
  • E.g. Relative mobility versus absolute mobility

26
Logistic regression model
  • Ratio p / 1-p Odds yes versus no range
    0-infinity
  • The natural logarithm of this odds (log-odds or
    logit)
  • 0 lt odds lt 1 log-odds lt 0
  • odds gt 1 log-odds gt 0
  • odds1 log-odds0

27
(No Transcript)
28
The impact of X-variables on the logit
  • The logit is a linear function of the X
    variables.
  • Logististic regression coefficient b
  • Antilog of b eb
  • Odds ratio

29
Odds ratio
Odds ratio (A / B) / (C / D)
30
Back to probabilities
31
Likelihood function
  • Between 0 and 1
  • Log likelihood between -? and 0
  • -2 LL between 0 and ?
  • Chi-square distributed (?2)

32
(No Transcript)
33
An example in stata who votes?
34
Logit Postestimation
  • Predict
  • P probability Y1
  • Xb Linear prediction ln(p/1-p)
  • Rs standardized residual

35
Ordered logit model
Similar to estimating separate binary logit
models with equal slopes Advantage total of
probabilities 1 Problem the proportional odds
assumption
36
Multinomial logit model
Similar to estimating separate binary logit
models with unequal slopes Advantage total of
probabilities 1 Problem many parameter
estimates
Write a Comment
User Comments (0)
About PowerShow.com