Title: Multiple Regression
1Multiple Regression
- Simple linear regression
- One independent or predictor variable
- E.g. volume regressed on diameter alone
- Multiple linear regression
- More than one possible independent variable
- E.g. volume on diameter and/or height
2Multiple regression
- Regression equations easily fitted using standard
packages. - Additional question.
- Which, if any, variables should be included in
the equation?
3Trees data
- Dependent variable volume
- Explanatory variables diameter, height
- Possible models Volume regressed on
- Diameter only
- Height only
- Diameter and height
- Neither variable
4Selection of variables
- Aim (may have both or just one)
-
- to obtain as simple a model as possible which
will give accurate forecasts. - to explain which variables affect the dependent
variable and in what way.
5Volume on diameter
- The regression equation is Volume - 36.9 5.07
Diameter - Predictor Coef StDev T
P - Constant -36.943 3.365 -10.98
0.000 - Diameter 5.0659 0.2474 20.48
0.000 - S 4.252 R-Sq 93.5 R-Sq(adj) 93.3
- High R-sq
- P-value for diameter 0.000 lt 0.05, indicates
diameter is significant and should be retained in
the model.
6Volume on diameter and height
- The regression equation is Volume - 58.0 4.71
Diameter 0.339 Height - Predictor Coef StDev T
P - Constant -57.988 8.638 -6.71
0.000 - Diameter 4.7082 0.2643 17.82
0.000 - Height 0.3393 0.1302 2.61
0.014 - S 3.882 R-Sq 94.8 R-Sq(adj) 94.4
- R-sq slightly higher
- p-value for height, 0.014 lt 0.05, suggests height
should be retained in model.
7Note
- P-values test the hypothesis that the
corresponding coefficient is 0. - Equivalent to testing that the variable adds
nothing to the equation. - R-sq will always increase when variables are
added into the model.
8Multicollinearity
- When there are significant correlations between
the variables this is termed multicollinearity. - As well as posing problems in interpretation,
multicollinearity can cause inaccuracy in the
estimates of the parameters. - There is usually some degree of multicollinearity
present.
9Variance Inflation Factors
- Used to assess whether or not multicollinearity
may be a problem - Variance inflation factors, VIF.
- Obtained on Minitab by selecting the Options
button in the regression dialog box. - Variance inflation factors of 10 or more are
considered a problem
10Methods of selection
- Fit all variables look at R-sq, p-values, omit
variables which are not significant. - Automatic methods
- Backward Elimination
- Forward Selection
- Stepwise
- Best subsets
11Spider data
- The file spider.mtw contains information on the
numbers of a rare species of spider, eresus niger
on different patches of heathland. The variables
are - Â
- eresus number of rare spider
- gorse density of gorse bushes
- cover percentage coverage of vegetation
- height average height of vegetation
- visitors average number of visitors to heath per
day - area area of heathland patches
12StatgtRegressiongtRegression
13Output
- The regression equation is
- eresus 222 1.14 gorse 0.181 cover - 0.552
height - 2.74 visitors - 2.59 area
- Â
- Predictor Coef SE Coef T
P - Constant 221.88 36.97 6.00
0.004 - gorse 1.1375 0.3465 3.28
0.030 - cover 0.1806 0.1607 1.12
0.324 - height -0.55162 0.09012 -6.12
0.004 - visitors -2.7436 0.3136 -8.75
0.001 - area 2.5864 0.3684 7.02
0.002 - Â
- P lt 0.05 indicates a significant predictor of
spider numbers, - P gt 0.05 indicates variable may be omitted from
the equation. - Next step would be to fit the equation omitting
cover.
14Output continued
- S 7.380 R-Sq 97.3 R-Sq(adj)
94.0 - Â
- Â R-sq and R-sq(adj) are high indicating a useful
model. - Â
- Â
- The Plt 0.05 in the analysis of variance
indicates at least some of the predictors in the
model are significant - Â
- Â Analysis of Variance
- Â
- Source DF SS MS
F P - Regression 5 7966.2 1593.2
29.25 0.003 - Residual Error 4 217.9 54.5
- Total 9 8184.1
- Â
15Adjusted R-sq
- Adjusted R-sq value takes into account the number
of variables being included in the equation and
therefore does not necessarily increase as the
number of variables increases, unlike R-sq
16Residuals should be checked
17Graphs
Are residuals normally distributed? Yes Normal
plot linear
Can equal variances be assumed? Yes random
scatter
18StatgtRegressiongtBest Subsets
19Summarises different models
- Best Subsets Regression eresus versus gorse,
cover, ... - Response is eresus
- Â
v -
i -
h s - g
c e i - o
o i t a - r
v g o r - s
e h r e - Vars R-Sq R-Sq(adj) C-p S e
r t s a - Â
- 1 32.4 23.9 95.6 26.306
X - 1 22.4 12.7 110.6 28.174
X - 2 54.2 41.1 64.8 23.141
X X - 2 49.7 35.4 71.5 24.244
X X - 3 83.9 75.8 22.2 14.827
X X X - 3 69.7 54.5 43.6 20.340
X X X - 4 96.5 93.7 5.3 7.5714 X
X X X - 4 90.2 82.3 14.8 12.688
X X X X
20Mallows Cp
- A good model has Cp close to p where p is the
number of variables plus 1. - Note that if the maximum number of variables is n
then Cn1 n 1. - Large values of Cp indicate a poor model.
21Best Subsets
- Highest R-sq using only one variable is only
32.4. - Large increase in R-sq by using 2 variables
rather than 1 variable, 3 rather than 2, 4 rather
than 3. - Increase in R-sq and R-sq(adj) going from 4
variables to 5 is small. - C-p values for 1, 2 and 3 variables models are
high indicating poor model. - C-p for best 4 variable model is close to the
expected value of 5. - This output is just a first step in choosing a
model since it does not give the regression
equation or any other diagnostic output.
22Stepwise, automatic procedures
- Backward Elimination includes all variables in
the equation and omits non-significant variables
one by one. Â - Forward Selection starts by regressing on just
one other variable and then variables are added
one at a time. The first variable to be entered
is the one with the highest correlation with the
dependent variable. - Stepwise regression starts in the same way as
the forward selection procedure. When a regressor
is added the backward elimination procedure is
used on the variables already in the equation.
23Stepwise regression
24 Alpha-to-Enter 0.2 Alpha-to-Remove 0.2 Â
Response is eresus on 5 predictors, with N
10 Â Â Step 1 2 3
4 5 6 Constant 124.60 198.11
67.81 183.94 221.88 230.65 Â cover
0.90 0.89 0.65 0.43 0.18
T-Value 1.96 2.10 1.75 1.79
1.12 P-Value 0.086 0.074
0.131 0.134 0.324 Â visitors
-0.81 -1.06 -2.45 -2.74
-2.88 T-Value -1.56 -2.32
-4.74 -8.75 -9.65 P-Value
0.164 0.059 0.005 0.001 0.000 Â area
1.68 2.77 2.59
2.70 T-Value 1.99
4.42 7.02 7.44 P-Value
0.094 0.007 0.002 0.001 Â height
-0.489 -0.552
-0.586 T-Value
-3.23 -6.12 -6.75 P-Value
0.023 0.004 0.001 Â gorse
1.14
1.32 T-Value
3.28 4.24 P-Value
0.030 0.008 Â S
26.3 24.2 20.3 12.7 7.38
7.57 R-Sq 32.35 49.73 69.67
90.17 97.34 96.50 R-Sq(adj) 23.90
35.37 54.50 82.30 94.01 93.70 C-p
95.6 71.5 43.6 14.8 6.0
5.3
25Steps
- Step 1 cover put into model
- Step 2 visitors added into model
- Steps 3-5 area, height then gorse added
- Step 6 cover removed
- Final model
- eresus regressed on visitors, area, height and
gorse.
26Equation
- Can be read from stepwise output
- Eresus 230.65 2.88visitors 2.59area
0.552height 1.14 gorse - Results suggest numbers decline as visitors
increase and as height of vegetation increase,
the larger the are the higher the numbers and
numbers also increase as the amount of gorse
increases.
27Note
- Different approaches to selection may result in
different models if multicollinearity is present. - Never rely on just one automatic selection
method.