Title: Multiple regression
1Multiple regression
- V506 Class 12
- November 12, 2009
2Overview
- Theory and common sense
- Stepwise regression
- Stepwise regression in SPSS
- Qualitative (dummy) variables
- Curvilinear relationships
- Transforming variables in SPSS
- Multicollinearity
3Theory and common sense
- How do you select independent variables for
multiple regression model? - Some guidance provided by theory suggesting
causal relationships - Other variables may be suggested by common sense,
general knowledge of what might affect dependent
variable - But there should be reasons for the inclusion of
any independent variable
4Stepwise regression
- Given a list of candidate independent variables,
which do you use for the regression model? - SPSS Linear Regression procedure includes
stepwise regression - Stepwise regression selects variables, one at a
time, that are likely to produce the regression
model that most effectively predicts the
dependent variable
5Stepwise regression
- Stepwise regression has (deservedly) gained a bad
reputation among statisticians - Tendency by some to throw in any possible
variable as a potential independent variable, let
the stepwise procedure sort through to find a
regression model
6Stepwise regression
- Used in this manner, high likelihood of
relationships being found that depend on random
variation in sample rather than true
relationships in population - But with a more carefully selected, limited set
of possible independent variables, stepwise
regression can be a valuable tool in finding a
regression model
7Stepwise regression to predict housing value
- Predict housing value in AffHsgEx.sav
- As independent variables, choose pct renter-occ
units with rent lt 200, metropolitan area
population, pct population change, and median
family income
8Stepwise regression outputvariables
entered/removed
9Stepwise regression outputmodel summary
10Stepwise regression output--ANOVA
11Stepwise regression output--coefficients
12Stepwise regression outputexcluded variables
13Stepwise regression in SPSS
- Use Statistics, Regression, Linear command
- Enter all of the candidate independent variables
- Select Method Stepwise
14Using qualitative variables in regression
- So far we have used one or more quantitative
independent variables as predictor(s) of a
continuous dependent variable - But sometimes it can be useful to include
qualitative, categorical variables as predictors
in multiple regressions
15Predicting housing values using median family
income
- Start trying to predict the median value of
owner-occupied houses in metropolitan areas using
median family income
16Regression results
17Regression results (continued)
18Including a categorical variable
- Some other analyses suggest that the region of
the county has a strong effect on housing values - With median housing values being highest in the
West
19Boxplots of housing value by region
20ANOVA of housing value by region
21Creating a West dummy variable
- Housing values are higher in the West
- Want to include information on whether
metropolitan areas is in the West in the
regression - Create dummy variable
- Value 1 if metro area in West
- Value 0 if metro area not in West
22Regression results
23Regression results
24Interpreting dummy variable regression coefficient
- Regression coefficient for the West dummy
variable is 22895.976 - Significance is 0.000, so regression coefficient
is significantly different from zero (reject null
hypothesis of equal to zero) - Coefficient says housing values are 22,895
higher in West than in other regions, after
controlling for effect of median family income
25Nonlinear relationships
- Regression assumes linear relationships between
the dependent variable and the independent
variables - But variables can be related to one another with
a nonlinear, curvilinear relationship
26Relationship of rent of new units to population
- Looking at data for 101 of the largest
metropolitan areas in 1980 - Dependent variablerent of renter-occupied units
built 1975-1980 - Independent variablepopulation
27Scatterplot
28Regression results
29Regression results
30Scatterplot illustrating nonlinear relationship
31Form of relationship
- Rent of new units increases with population
- But the amount of that increase seems to decline
as population increases - Suggests possibility of nonlinear relationships
32Variable transformation
- Can often handle nonlinear relationships by doing
a mathematical transformation of the independent
or dependent variable that makes the relationship
linear - Curve on the scatterplot shows what the
relationship would be if rent were related to the
natural logarithm of population
33Doing variable transformation
- Create new variable that has the value of the
natural logarithm of population - Use that new variable as the independent variable
in the regression - Regression equation
34Scatterplot of rent versus log of population
35Regression results using log of population
36Regression results using log of population
37Fitted linear and logarithmic regressions
38Analysis of residuals
- Sometimes it is easier to understand what is
going on in a regression by looking at the
residualsthe errors in prediction of the
dependent variable - Can plot the residuals versus the predicted
values to look for patterns in the residuals - Normally plot standardized residuals
- Perfect fit would then be horizontal line at 0
- Scatter above and below indicate errors in
prediction
39Residualspredicting rent with population
40Interpreting the plot of residuals
- Note the curve in the pattern of the residuals
- This indicates the presence of a nonlinear
relationship - Suggests using some transform of the independent
variable to create a more linear relationship - Also a lot more variation (larger residuals) for
areas with smaller populationsproblem of
heteroskedasticity
41Residualspredicting rent with log of population
42Interpreting plot for the revised regression
- More of a random scattering of the residuals than
before - Lack of distinct pattern suggests relationship is
closer to being linear - Also, somewhat more even amounts of variation at
different population levels except for highest
less of a problem of heteroskedasticity
43Library use as a function of travel time
- Percent using library in different zip codes as
function of distance to library - Linear
- Negative exponential
44Library use as a function of travel timelinear
45Library use as a function of travel timenegative
exponential
46Possible forms for variable transformation
- Could use any mathematical function
- Commonly-used functions include
- Natural logarithm of variable
- Square of variable
- Inverse of variable
- Variable and its square (quadratic function)
47Transforming variables in SPSS
- Use Transform, Compute command
- For Target Variable, enter new variable name for
new variable - Use TypeLabel button to enter variable label for
new variable - Create Numeric Expression as function of other
variable(s), using Functions, if necessary
48Multicollinearity
- High levels of intercorrelation among independent
variables can produce unstable estimates of
regression coefficients and insignificant results - Produce regression models that are not very
informative or useful
49Percent births to teens in Marion County
- Attempt to predict the percentage of births to
teenage mothers by census tract in Marion County - Hypothesis that this would be affected by
socioeconomic status - Conduct first regression using percent college
graduates, median family income
50First regression output
51First regression output
52Expanding the regression
- Both independent variables are significant
- Given that logic regarding socioeconomic status
seems sound, add percent persons below poverty
level, percent high school graduates
53Second regression output
54Second regression output
55Multicollinearity in the results
- Note that regression coefficients for percent
college grads and median family income are very
different, are no longer significant - Nor is percent below poverty level significant
- Problem of multicollinearity
- Results from high levels of correlations among
independent variables - Regression is no longer very useful
56Correlations among variables
57Signs of multicollinearity
- Significant correlations between pairs of
independent variables - Nonsignificant tests for some or all of
regression coefficients when overall model is
significant - Opposite signs from what is expected
58SPSS scatterplot creation
- Use the Graphs, Chart Builder command
- Click on the Gallery tab and select Scatter/Dot
on the list of graph types in the lower left - Drag the Simple Scatter icon (the first one) onto
the canvas (the blank area at the top) - Drag the independent variable into the x-axis
drop zone - Drag the dependent variable into the y-axis drop
zone
59SPSS correlation
- Use Analyze, Correlate, Bivariate command
- Select variables for correlation matrix
- Check mark next to Pearson
- Can specify one-tailed tests of significance if
desired (or simply divide reported Sig. by 2)
60SPSS linear regression
- Use Statistics, Regression, Linear command
- Select Dependent Variable
- Select Independent Variable
- Can use Statistics to specify descriptive
statistics if desired
61Stepwise regression in SPSS
- Use Statistics, Regression, Linear command
- Enter all of the candidate independent variables
- Select Method Stepwise
62Transforming variables in SPSS
- Use Transform, Compute command
- For Target Variable, enter new variable name for
new variable - Use TypeLabel button to enter variable label for
new variable - Create Numeric Expression as function of other
variable(s), using Functions, if necessary