Advanced regression analysis in Stata presentation

About This Presentation

Transcript and Presenter's Notes

Title: Advanced regression analysis in Stata

1
Advanced regression analysis in Stata

ASSR Short Intensive Course
Herman van de Werfhorst
April 2009
Day 1 introduction to Stata and multivariate
modelling

2
Programme

Day 1
Introduction to stata
Basics the linear model (OLS)
Regression diagnostics
Logistic regression (logit)
Day 2
Mixed models fixed and random effects
Applications of mixed models in stata

3
Introduction to stata

Data files .dta
Do files command logs
Ado files additional programmes developed by the
research community
Log files saved output
Automatically generating publishable tables

4
Hands on stata

Creating .do file
Starting output log file
Missing values
Scale construction
Factor analysis
Reliability analysis

5
Scale construction
Reliability coefficient Cronbachs Alpha
Nnumber of items, r average correlation
between items
Low reliability reduces correlations.
6
Scale construction 1 The Likert scale.

1. Recode variables/items in the right direction.
High scores should indicate a high position on
the underlying dimension
2. Take the average score on all selected items

7
Example ESS2002 immigration attitudes
8
Scale construction 2 Factor analysis
Item1
Factor 1
Item2
Item3
Factor 2
Item4
Item5
Some handy info - Kim Mueller, factor
analysis (Sage publications, somewhere in the
1970s) - http//www.siu.edu/epse1/pohlmann/factgl
os/
9
OLS regression

Minimize sums of squared errors
Y a bX e

10
Coefficient of determination R2 Proportion of
variance explained by X variables
11
Diagnostics multicollinearity

Bivariate correlations. gt 0.7 suspicious
Regression of every X variable on other X
variables (x1a b1X2 b2X3 etc)
R-sq larger than 0.6 critical.
Tolerance 1-Rsq, less than 0.4 critical.
1/Tolerance Variance inflation factor (VIF),
larger than 2.5 is critical (some say larger than
6!)
VIF How much is the standard error inflated
compared to when X variables were not correlated

12
Examine residuals

Studentized residual (standardized, accounting
for possible different variances according to X)
Rule of thumb studentized residuals larger than
3.61 are outliers
Removal leads to lower standard errors of
estimate, higher R-sq. Non-removal makes tests
more conservative

13
Studentized residuals
14
Outliers influential cases

Leverage how far the individuals X value
differs from the mean X. The larger the value,
the stronger the impact on determining y-hat
values

15
Outliers (contd)

DFBETA (i) change in the estimate of ß when
deleting individual i (one for each ß for each i)
DFFIT (i) effect on the fit of deleting
individual i. (one for each i)

Cooks Distance impact of i on all parameter
estimates jointly.
Influence of i is a function of the residual (on
Y) and of the place in the distribution in X

17
Dfbeta in stata standardized dfbetas. gt 1
suspicious
18
(No Transcript)
19
Logistic regression
20
Non-continuous outcomes

Binary
Ordinal
Categorical
OLS has no assumptions on the distribution of the
X variables
It however assumes a continuous Y variable, with
conditional normal distribution
Binary outcomes Logit models logistic
regression models
Part of the Generalized Linear Model framework
(GLM)
If OLS regression applied to binary outcomes
predicted probability could be lt0 or gt1 (which is
impossible)

21
The problem of the linear probability model
22
The dependent variable

Outcome 0 or 1
(0 no 1 yes)
Probabilities (P) between 0 and 1.
If 30 of respondents are voluntary member, the
mean probability of membership is 0.3.

23
The Generalized Linear Model (GLM)
24
GLMs

The function g(µ) is the link function
Identity link (OLS)
Logit link
Log link
Probit link
Next to a link function, you also need to specify
the probability distribution (normal, binomial,
gamma, etc).
Thus GLMs let you choose the probability
distribution instead of assuming it.
Logistic regression a binomial distribution,
with a logit link
Advantage over probit we interpret the results
in terms of odds (p/1-p), and thus in odds ratios.

25
And the winner of the Best Statistic Ever Award
is.

The odds ratio
(in my opinion)
Advantage margin-independent association measure
for contingency tables.
E.g. Relative mobility versus absolute mobility

26
Logistic regression model

Ratio p / 1-p Odds yes versus no range
0-infinity
The natural logarithm of this odds (log-odds or
logit)
0 lt odds lt 1 log-odds lt 0
odds gt 1 log-odds gt 0
odds1 log-odds0

27
(No Transcript)
28
The impact of X-variables on the logit

The logit is a linear function of the X
variables.
Logististic regression coefficient b
Antilog of b eb
Odds ratio

29
Odds ratio
Odds ratio (A / B) / (C / D)
30
Back to probabilities
31
Likelihood function

Between 0 and 1
Log likelihood between -? and 0
-2 LL between 0 and ?
Chi-square distributed (?2)

32
(No Transcript)
33
An example in stata who votes?
34
Logit Postestimation

Predict
P probability Y1
Xb Linear prediction ln(p/1-p)
Rs standardized residual

35
Ordered logit model
Similar to estimating separate binary logit
models with equal slopes Advantage total of
probabilities 1 Problem the proportional odds
assumption
36
Multinomial logit model
Similar to estimating separate binary logit
models with unequal slopes Advantage total of
probabilities 1 Problem many parameter
estimates

Write a Comment

User Comments (0)

About PowerShow.com

Advanced regression analysis in Stata PowerPoint PPT Presentation