Title: Advanced regression analysis in Stata
 1Advanced regression analysis in Stata
- ASSR Short Intensive Course 
 - Herman van de Werfhorst 
 - April 2009 
 - Day 1 introduction to Stata and multivariate 
modelling  
  2Programme
- Day 1 
 - Introduction to stata 
 - Basics the linear model (OLS) 
 - Regression diagnostics 
 - Logistic regression (logit) 
 - Day 2 
 - Mixed models fixed and random effects 
 - Applications of mixed models in stata
 
  3Introduction to stata
- Data files .dta 
 - Do files command logs 
 - Ado files additional programmes developed by the 
research community  - Log files saved output 
 - Automatically generating publishable tables
 
  4Hands on stata
- Creating .do file 
 - Starting output log file 
 - Missing values 
 - Scale construction 
 - Factor analysis 
 - Reliability analysis 
 
  5Scale construction
Reliability coefficient Cronbachs Alpha
Nnumber of items, r  average correlation 
between items
Low reliability reduces correlations. 
 6Scale construction 1 The Likert scale.
- 1. Recode variables/items in the right direction. 
High scores should indicate a high position on 
the underlying dimension  - 2. Take the average score on all selected items 
 
  7Example ESS2002 immigration attitudes 
 8Scale construction 2 Factor analysis
Item1
Factor 1
Item2
Item3
Factor 2
Item4
Item5
Some handy info - Kim  Mueller, factor 
analysis (Sage publications, somewhere in the 
1970s) - http//www.siu.edu/epse1/pohlmann/factgl
os/ 
 9OLS regression
- Minimize sums of squared errors 
 - Y  a  bX  e 
 
  10Coefficient of determination R2 Proportion of 
variance explained by X variables 
 11Diagnostics multicollinearity
- Bivariate correlations. gt 0.7 suspicious 
 - Regression of every X variable on other X 
variables (x1a  b1X2  b2X3 etc)  - R-sq larger than 0.6 critical. 
 - Tolerance 1-Rsq, less than 0.4 critical. 
 - 1/Tolerance Variance inflation factor (VIF), 
larger than 2.5 is critical (some say larger than 
6!)  - VIF How much is the standard error inflated 
compared to when X variables were not correlated 
  12Examine residuals
- Studentized residual (standardized, accounting 
for possible different variances according to X)  - Rule of thumb studentized residuals larger than 
3.61 are outliers  - Removal leads to lower standard errors of 
estimate, higher R-sq. Non-removal makes tests 
more conservative 
  13Studentized residuals 
 14Outliers influential cases
- Leverage how far the individuals X value 
differs from the mean X. The larger the value, 
the stronger the impact on determining y-hat 
values  
  15Outliers (contd)
- DFBETA (i) change in the estimate of ß when 
deleting individual i (one for each ß for each i)  - DFFIT (i) effect on the fit of deleting 
individual i. (one for each i) 
  16- Cooks Distance impact of i on all parameter 
estimates jointly.  - Influence of i is a function of the residual (on 
Y) and of the place in the distribution in X  
  17Dfbeta in stata standardized dfbetas. gt 1 
suspicious 
 18(No Transcript) 
 19Logistic regression 
 20Non-continuous outcomes
- Binary 
 - Ordinal 
 - Categorical 
 - OLS has no assumptions on the distribution of the 
X variables  - It however assumes a continuous Y variable, with 
conditional normal distribution  - Binary outcomes Logit models  logistic 
regression models  - Part of the Generalized Linear Model framework 
(GLM)  - If OLS regression applied to binary outcomes 
predicted probability could be lt0 or gt1 (which is 
impossible) 
  21The problem of the linear probability model 
 22The dependent variable
- Outcome 0 or 1 
 - (0 no 1 yes) 
 - Probabilities (P) between 0 and 1. 
 - If 30 of respondents are voluntary member, the 
mean probability of membership is 0.3.  
  23The Generalized Linear Model (GLM) 
 24GLMs
- The function g(µ) is the link function 
 - Identity link (OLS) 
 - Logit link 
 - Log link 
 - Probit link 
 -  
 - Next to a link function, you also need to specify 
the probability distribution (normal, binomial, 
gamma, etc).  - Thus GLMs let you choose the probability 
distribution instead of assuming it.  - Logistic regression a binomial distribution, 
with a logit link  - Advantage over probit we interpret the results 
in terms of odds (p/1-p), and thus in odds ratios. 
  25And the winner of the Best Statistic Ever Award 
is.
- The odds ratio 
 - (in my opinion) 
 - Advantage margin-independent association measure 
for contingency tables.  - E.g. Relative mobility versus absolute mobility 
 
  26Logistic regression model
- Ratio p / 1-p Odds yes versus no range 
0-infinity  - The natural logarithm of this odds (log-odds or 
logit)  - 0 lt odds lt 1 log-odds lt 0 
 - odds gt 1 log-odds gt 0 
 - odds1 log-odds0 
 
  27(No Transcript) 
 28The impact of X-variables on the logit
- The logit is a linear function of the X 
variables.   - Logististic regression coefficient b 
 - Antilog of b eb 
 -  Odds ratio
 
  29Odds ratio
Odds ratio  (A / B) / (C / D) 
 30Back to probabilities 
 31Likelihood function
- Between 0 and 1 
 - Log likelihood between -? and 0 
 - -2 LL between 0 and  ? 
 - Chi-square distributed (?2) 
 
  32(No Transcript) 
 33An example in stata who votes? 
 34Logit Postestimation 
- Predict 
 - P probability Y1 
 - Xb Linear prediction ln(p/1-p) 
 - Rs standardized residual 
 
  35Ordered logit model
Similar to estimating separate binary logit 
models with equal slopes Advantage total of 
probabilities  1 Problem the proportional odds 
assumption 
 36Multinomial logit model
Similar to estimating separate binary logit 
models with unequal slopes Advantage total of 
probabilities  1 Problem many parameter 
estimates