Multiple Regression - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Multiple Regression

Description:

Volume on diameter and height ... Predictor Coef StDev T P ... P 0.05 indicates a significant predictor of spider numbers, ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 28
Provided by: mp2
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression


1
Multiple Regression
  • Simple linear regression
  • One independent or predictor variable
  • E.g. volume regressed on diameter alone
  • Multiple linear regression
  • More than one possible independent variable
  • E.g. volume on diameter and/or height

2
Multiple regression
  • Regression equations easily fitted using standard
    packages.
  • Additional question.
  • Which, if any, variables should be included in
    the equation?

3
Trees data
  • Dependent variable volume
  • Explanatory variables diameter, height
  • Possible models Volume regressed on
  • Diameter only
  • Height only
  • Diameter and height
  • Neither variable

4
Selection of variables
  • Aim (may have both or just one)
  • to obtain as simple a model as possible which
    will give accurate forecasts.
  • to explain which variables affect the dependent
    variable and in what way.

5
Volume on diameter
  • The regression equation is Volume - 36.9 5.07
    Diameter
  • Predictor Coef StDev T
    P
  • Constant -36.943 3.365 -10.98
    0.000
  • Diameter 5.0659 0.2474 20.48
    0.000
  • S 4.252 R-Sq 93.5 R-Sq(adj) 93.3
  • High R-sq
  • P-value for diameter 0.000 lt 0.05, indicates
    diameter is significant and should be retained in
    the model.

6
Volume on diameter and height
  • The regression equation is Volume - 58.0 4.71
    Diameter 0.339 Height
  • Predictor Coef StDev T
    P
  • Constant -57.988 8.638 -6.71
    0.000
  • Diameter 4.7082 0.2643 17.82
    0.000
  • Height 0.3393 0.1302 2.61
    0.014
  • S 3.882 R-Sq 94.8 R-Sq(adj) 94.4
  • R-sq slightly higher
  • p-value for height, 0.014 lt 0.05, suggests height
    should be retained in model.

7
Note
  • P-values test the hypothesis that the
    corresponding coefficient is 0.
  • Equivalent to testing that the variable adds
    nothing to the equation.
  • R-sq will always increase when variables are
    added into the model.

8
Multicollinearity
  • When there are significant correlations between
    the variables this is termed multicollinearity.
  • As well as posing problems in interpretation,
    multicollinearity can cause inaccuracy in the
    estimates of the parameters.
  • There is usually some degree of multicollinearity
    present.

9
Variance Inflation Factors
  • Used to assess whether or not multicollinearity
    may be a problem
  • Variance inflation factors, VIF.
  • Obtained on Minitab by selecting the Options
    button in the regression dialog box.
  • Variance inflation factors of 10 or more are
    considered a problem

10
Methods of selection
  • Fit all variables look at R-sq, p-values, omit
    variables which are not significant.
  • Automatic methods
  • Backward Elimination
  • Forward Selection
  • Stepwise
  • Best subsets

11
Spider data
  • The file spider.mtw contains information on the
    numbers of a rare species of spider, eresus niger
    on different patches of heathland. The variables
    are
  •  
  • eresus number of rare spider
  • gorse density of gorse bushes
  • cover percentage coverage of vegetation
  • height average height of vegetation
  • visitors average number of visitors to heath per
    day
  • area area of heathland patches

12
StatgtRegressiongtRegression
13
Output
  • The regression equation is
  • eresus 222 1.14 gorse 0.181 cover - 0.552
    height - 2.74 visitors
  • 2.59 area
  •  
  • Predictor Coef SE Coef T
    P
  • Constant 221.88 36.97 6.00
    0.004
  • gorse 1.1375 0.3465 3.28
    0.030
  • cover 0.1806 0.1607 1.12
    0.324
  • height -0.55162 0.09012 -6.12
    0.004
  • visitors -2.7436 0.3136 -8.75
    0.001
  • area 2.5864 0.3684 7.02
    0.002
  •  
  • P lt 0.05 indicates a significant predictor of
    spider numbers,
  • P gt 0.05 indicates variable may be omitted from
    the equation.
  • Next step would be to fit the equation omitting
    cover.

14
Output continued
  • S 7.380 R-Sq 97.3 R-Sq(adj)
    94.0
  •  
  •  R-sq and R-sq(adj) are high indicating a useful
    model.
  •  
  •  
  • The Plt 0.05 in the analysis of variance
    indicates at least some of the predictors in the
    model are significant
  •  
  •  Analysis of Variance
  •  
  • Source DF SS MS
    F P
  • Regression 5 7966.2 1593.2
    29.25 0.003
  • Residual Error 4 217.9 54.5
  • Total 9 8184.1
  •  

15
Adjusted R-sq
  • Adjusted R-sq value takes into account the number
    of variables being included in the equation and
    therefore does not necessarily increase as the
    number of variables increases, unlike R-sq

16
Residuals should be checked
17
Graphs


Are residuals normally distributed? Yes Normal
plot linear
Can equal variances be assumed? Yes random
scatter
18
StatgtRegressiongtBest Subsets
19
Summarises different models
  • Best Subsets Regression eresus versus gorse,
    cover, ...
  • Response is eresus
  •  
    v

  • i

  • h s
  • g
    c e i
  • o
    o i t a
  • r
    v g o r
  • s
    e h r e
  • Vars R-Sq R-Sq(adj) C-p S e
    r t s a
  •  
  • 1 32.4 23.9 95.6 26.306
    X
  • 1 22.4 12.7 110.6 28.174
    X
  • 2 54.2 41.1 64.8 23.141
    X X
  • 2 49.7 35.4 71.5 24.244
    X X
  • 3 83.9 75.8 22.2 14.827
    X X X
  • 3 69.7 54.5 43.6 20.340
    X X X
  • 4 96.5 93.7 5.3 7.5714 X
    X X X
  • 4 90.2 82.3 14.8 12.688
    X X X X

20
Mallows Cp
  • A good model has Cp close to p where p is the
    number of variables plus 1.
  • Note that if the maximum number of variables is n
    then Cn1 n 1.
  • Large values of Cp indicate a poor model.

21
Best Subsets
  • Highest R-sq using only one variable is only
    32.4.
  • Large increase in R-sq by using 2 variables
    rather than 1 variable, 3 rather than 2, 4 rather
    than 3.
  • Increase in R-sq and R-sq(adj) going from 4
    variables to 5 is small.
  • C-p values for 1, 2 and 3 variables models are
    high indicating poor model.
  • C-p for best 4 variable model is close to the
    expected value of 5.
  • This output is just a first step in choosing a
    model since it does not give the regression
    equation or any other diagnostic output.

22
Stepwise, automatic procedures
  • Backward Elimination includes all variables in
    the equation and omits non-significant variables
    one by one.  
  • Forward Selection starts by regressing on just
    one other variable and then variables are added
    one at a time. The first variable to be entered
    is the one with the highest correlation with the
    dependent variable.
  • Stepwise regression starts in the same way as
    the forward selection procedure. When a regressor
    is added the backward elimination procedure is
    used on the variables already in the equation.

23
Stepwise regression
24
Alpha-to-Enter 0.2 Alpha-to-Remove 0.2  
Response is eresus on 5 predictors, with N
10    Step 1 2 3
4 5 6 Constant 124.60 198.11
67.81 183.94 221.88 230.65   cover
0.90 0.89 0.65 0.43 0.18
T-Value 1.96 2.10 1.75 1.79
1.12 P-Value 0.086 0.074
0.131 0.134 0.324   visitors
-0.81 -1.06 -2.45 -2.74
-2.88 T-Value -1.56 -2.32
-4.74 -8.75 -9.65 P-Value
0.164 0.059 0.005 0.001 0.000   area
1.68 2.77 2.59
2.70 T-Value 1.99
4.42 7.02 7.44 P-Value
0.094 0.007 0.002 0.001   height
-0.489 -0.552
-0.586 T-Value
-3.23 -6.12 -6.75 P-Value
0.023 0.004 0.001   gorse
1.14
1.32 T-Value
3.28 4.24 P-Value
0.030 0.008   S
26.3 24.2 20.3 12.7 7.38
7.57 R-Sq 32.35 49.73 69.67
90.17 97.34 96.50 R-Sq(adj) 23.90
35.37 54.50 82.30 94.01 93.70 C-p
95.6 71.5 43.6 14.8 6.0
5.3
25
Steps
  • Step 1 cover put into model
  • Step 2 visitors added into model
  • Steps 3-5 area, height then gorse added
  • Step 6 cover removed
  • Final model
  • eresus regressed on visitors, area, height and
    gorse.

26
Equation
  • Can be read from stepwise output
  • Eresus 230.65 2.88visitors 2.59area
    0.552height 1.14 gorse
  • Results suggest numbers decline as visitors
    increase and as height of vegetation increase,
    the larger the are the higher the numbers and
    numbers also increase as the amount of gorse
    increases.

27
Note
  • Different approaches to selection may result in
    different models if multicollinearity is present.
  • Never rely on just one automatic selection
    method.
Write a Comment
User Comments (0)
About PowerShow.com