Title: Chapter 11 Multiple Linear Regression
1 Chapter 11 Multiple Linear
Regression
2Our Group Members
3Content
- Multiple Regression Model -----Yifan Wang
- Statistical Inference ---Shaonan Zhang Yicheng
Li - Variable Selection Methods SAS
-
---Guangtao Li Ruixue Wang - Strategy for Building a Model and Data
Transformation - ---
Xiaoyu Zhang Siyuan Luo - Topics in Regression Modeling
-
----Yikang Chai Tao Li - Summary -----Xing Chen
-
-
4Ch 11.1-11.3 Introduction to Multiple Linear
Regression
- Yifan Wang
- Dec. 6th, 2007
5- Based on Chapter 10, we studied how to fit a
linear relationship between a response variable y
and a predictor variable x. - But, sometimes we cannot handle a problem using
simple linear regression, when there are two or
more predictor variables.
- For Example
- The salary of a company employee may depend on
- job category
- years of experience
- education
- performance evaluations
6?
What do we need to do
- Extend the simple linear regression model to the
case of two or more predictor variables. - Multiple Linear Regression (or simply Multiple
Regression) is the statistical methodology used
to fit such models.
7Multiple Linear Regression
- In multiple regression we fit a model of the
form (excluding the error term) - Where are
predictor variables and are k1
unknown parameters.
linear
For example This model includes the kth degree
polynomial model in a single variable x,
namely, Since we can put
.
811.1 A Probabilistic Model For Multiple Linear
Regression
- Regard the response variable as random
- Regard the predictor variables as nonrandom.
- The data for multiple regression consist of n
vectors of observations (
) for i 1,2,,n.
- Example 1
- The response variable the salary of the i
th person in the sample - The predictor variables his/her years of
experience -
his/her years of education.
9Example 2
is the observed value of the r.v..
depends on fixed
predictor values
according to the following
model
Where is a random error with 0, and
are unknown parameters. Assume are independent
random variables. Then the are independent
random variables with
1011.2 Fitting the Multiple Regression Model
11.2.1 Least Squares (LS) Fit
- The LS estimates of the unknown parameters
minimize - The LS can be found by setting the first partial
derivatives of Q with respect to
equal to zero. - The result is a set of simultaneous linear
equations in (k1) unknowns. The resulting
solutions, are the least squares
(LS) estimates of , respectively
1111.2.2 Goodness of Fit of the Model
- To access the goodness of fit of the LS model, we
use the residuals defined by - Where the are the fitted values
- An overall measure of the goodness of fit is the
error sum of squares (SSE) - Compare it to the total sum of squares (SST)
- As in Chapter 10, define the regression sum of
squares (SSR) given by
12- the coefficient of multiple determination
- , values closer to 1 represent
better fits - Adding predictor variables generally increases
, thus can be made to approach 1 by
increasing the number of predictors. - Multiple correlation coefficient (the positive
square root of ) - only positive square root is used
- r is a measure of the strength of the
association between the predictor variables and
the one response variable
1311.3 Multiple Regression Model in Matrix Notation
- The multiple regression model can be presented
in a compact form by using matrix notation. Let
be the n x 1 vectors of the r.v.s , their
observed values , and random errors ,
respectively. Next let
be the n x (k1) matrix of the values of
predictor variables.
14Finally Let
and
- be the (k 1) x 1 vectors of unknown parameters
and their LS estimates, respectively - The model can be rewritten as
- The simultaneous linear equations whose
solutions yields the LS estimates can be written
in matrix notation as
- If the inverse of the matrix exists,
then the solution is given by
1511.4 Statistical Inference
16Statistical Inference on ßs
----General Hypothesis Test
- Determining the statistical significance of
predictor variables -
- we test the hypotheses
-
-
- if we cant reject ,
can be dropped from the model
17Statistical Inference on ßs
----General Hypothesis Test
- Pivotal Quantity
- recall
- unbiased estimate of
-
error degrees of freedom
18Statistical Inference on ßs
----General Hypothesis Test
- Confidence Interval for
-
- Noted that
-
- So, the CI is
-
where
19Statistical Inference on ßs
----General Hypothesis Test
- Hypothesis Test
-
- Specially, when 0, we reject H0 if
P (Reject H0 H0 is true) ?
20Statistical Inference on ßs
----Another Hypothesis Test
- Hypothesis
- Pivotal Quantity
- also,
- P-value
- If P-value is less than a, we reject H0. And
we use the previous test in this case. -
21Statistical Inference on ßs
----Another Hypothesis Test
- ANOVA Table for Multiple Regression
Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F
Regression Error SSR SSE k n - (k1)
Total SST n - 1
22Statistical Inference on ßs
----Test Subsets of Parameters
- Full Model
- Partial Model
- Hypothesis
- test statistics
- reject H0 when
23Prediction of Future Observations
- Let and
- Whatever CI (Confidence Interval) or PI
(Prediction Interval) - we have
- and
- Pivotal Quantity
- a (1-) level CI to estimate ?
- a (1-) level PI to predict Y
2411.7Variable Selection Methods
Guangtao Li, RuiXue Wang
251. Why do we need variable selection
methods?
- 2. Two methods are introduced
- Stepwise Regression
- Best Subsets Regression
2611.7.1 STEPWISE REGRESSION
27Recall Test for Subsets of Parameters in 11.4
(i1,2,n)
(i1,2,n)
vs.
for at least one
28- (p-1)-variable model
- P-variable model
29Partial F-test
30Partial Correlation Coefficients
- We should add to the regression equation
only if is large enough, i.e., only if
is statistically significant.
31Stepwise Regression Algorithm
32SAS Program for the Algorithm
- Example The Director of Broadcasting Operations
for a television station wants to study the issue
of standby hours, which are hours where
unionized graphic artists at the station are paid
but are not actually involved in any activity. We
are trying to predict the total number of Standby
Hours per Week (Y). Possible explanatory
variables are Total Staff Present (X1), Remote
Hours(X2), Dubner Hours (X3) and Total Labor
Hours (X4). The results for 26 weeks are given
below.
33(No Transcript)
34Data test input y x1 x2 x3 x4 datalines 245 338
414 323 2001 177 333 598 340 2030 271 358 656 340
2226 211 372 631 352 2154 196 339 528 380 2078 13
5 289 409 339 2080 195 334 382 331 2073 118 293 39
9 311 1758 116 325 343 328 1624 147 311 338 353 18
89 154 304 353 518 1988 146 312 289 440 2049 115 2
83 388 276 1796
35- 161 307 402 207 1720
- 274 322 151 287 2056
- 245 335 228 290 1890
- 201 350 271 355 2187
- 183 339 440 300 2032
- 237 327 475 284 1856
- 175 328 347 337 2068
- 152 319 449 279 1813
- 188 325 336 244 1808
- 188 322 267 253 1834
- 197 317 235 272 1973
- 261 315 164 223 1839
- 232 331 270 272 1935
- run
- proc reg datatest
- model y x1 x2 x3 x4 /SELECTION stepwise
- run
36- Selected SAS Output
- Stepwise
Selection Step 1 - Variable x1 Entered
R-Square 0.3660 and C(p) 13.3215 -
Analysis of Variance -
Sum of Mean - Source DF
Squares Square F Value Pr gt F - Model 1
20667 20667 13.86 0.0011 - Error 24
35797 1491.55073 - Corrected Total 25 56465
- Parameter
Standard
37 Stepwise
Selection Step 2
Variable x2 Entered R-Square 0.4899 and C(p)
8.4193
Analysis of Variance
Sum of Mean
Source DF
Squares Square F Value Pr gt F
Model 2 27663
13831 11.05 0.0004
Error 23 28802
1252.26402 Corrected Total 25
56465
Parameter Standard Variable
Estimate Error Type II SS
F Value Pr gt F Intercept
-330.67483 116.48022 10092
8.06 0.0093 x1
1.76486 0.37904 27149
21.68 0.0001 x2
-0.13897 0.05880 6995.14489
5.59 0.0269
38SAS Output(cont)
- All variables left in the model are significant
at the 0.1500 level. - No other variable met the 0.1500 significance
level for entry into the model. - Summary of
Stepwise Selection - Variable Variable Number
Partial Model - Step Entered Removed Vars In
R-Square R-Square C(p) F Value Pr
gt F - 1 x1 1
0.3660 0.3660 13.3215 13.86
0.0011 - 2 x2 2
0.1239 0.4899 8.4193 5.59
0.0269
39 4011.7.2 Best Subsets Regression
4111.7.2 Best Subsets Regression
- In practice there are often several almost
equally good models, and the choice of the final
model may depend on side considerations such as
the number of variables, the ease of observing
and/or controlling variables, etc. The best
subsets regression algorithm permits
determination of a specified number of best
subsets of size p1,2,,k from which the choice
of the final model can be made by the
investigator.
4211.7.2 Best Subsets Regression
- Optimality Criteria
- rp2-Criterion
43 Cp-Criterion (recommended for its ease of
computation and its ability to judge the
predictive power of a model)
- The sample estimator, Mallows Cp-statistic, is
given by - is an almost unbiased estimator of
44PRESS p Criterion The total prediction
error sum of squares (press) is This
criterion evaluates the predictive ability of a
postulated model by omitting one observation at a
time, fitting the model based on the remaining
observations and computing the predicted value
for the omitted observation.The PRESS p
criterion is intuitively easier to grasp than the
Cp-Criterion , but it is computationally much
more intensive and is not available in many
packages.
45SAS PRGRAM
- Data test
- input y x1 x2 x3 x4
- datalines
- 245 338 414 323 2001
- 177 333 598 340 2030
- 271 358 656 340 2226
- 211 372 631 352 2154
- 196 339 528 380 2078
- 135 289 409 339 2080
- 195 334 382 331 2073
- 118 293 399 311 1758
- 116 325 343 328 1624
- 147 311 338 353 1889
- 154 304 353 518 1988
- 146 312 289 440 2049
- 115 283 388 276 1796
46SAS PRGRAM
- 161 307 402 207 1720
- 274 322 151 287 2056
- 245 335 228 290 1890
- 201 350 271 355 2187
- 183 339 440 300 2032
- 237 327 475 284 1856
- 175 328 347 337 2068
- 152 319 449 279 1813
- 188 325 336 244 1808
- 188 322 267 253 1834
- 197 317 235 272 1973
- 261 315 164 223 1839
- 232 331 270 272 1935
- run
- proc reg datatest
- model y x1 x2 x3 x4 /SELECTION RSQUARE
adjrsq CP mse - run
47Results
- Number in Adjusted
- Model R-Square R-Square
C(p) MSE Variables in Model - 1 0.3660 0.3396
13.3215 1491.55073 x1 - 1 0.1710 0.1365
24.1846 1950.27491 x4 - 1 0.0597 0.0205
30.3884 2212.24598 x3 - 1 0.0091 -.0322
33.2078 2331.30545 x2 - ----------------------------------------
------------------------------------------ - 2 0.4899 0.4456
8.4193 1252.26402 x1 x2 - 2 0.4499 0.4021
10.6486 1350.49234 x1 x3 - 2 0.4288 0.3791
11.8231 1402.24672 x3 x4 - 2 0.3754 0.3211
14.7982 1533.34044 x1 x4 - 2 0.2238 0.1563
23.2481 1905.67595 x2 x4 - 2 0.0612 -.0205
32.3067 2304.83375 x2 x3 - ----------------------------------------
------------------------------------------ - 3 0.5378 0.4748
7.7517 1186.29444 x1 x3 x4 - 3 0.5362 0.4729
7.8418 1190.44739 x1 x2 x3 - 3 0.5092 0.4423
9.3449 1259.69053 x1 x2 x4
4811.7.2 Best Subsets Regression SAS
- The resource of the example is http//www.math.ude
l.edu/teaching/course_materials/m202_climent/Multi
ple20Regression20-20Model20Building.pdf
4911.5, 11.8 Building A Multiple Regression
Modelby SiYuan Luo Xiaoyu Zhang
50Introduction
- Building a multiple regression model consists of
7 steps. - Though it is not necessary to follow each and
every step in exact sequence shown on the next
slide, the general approach and major steps
should be followed. - The model is an iterative process, it may take
several cycles of the steps before arriving at
the final model.
51The 7 steps
1.Decide the type
3.Explore the data
2.Collect the data
5.Fit candidate models
4.Divide the data
6.Select and evaluate
7.Select the final model
52Step 1 Decide the type
- Decide the type of model needed, different types
of models includes - Predictive a model used to predict the response
variable from a chosen set of predictor
variables. - Theoretical a model based on a theoretical
relationship between a response variable and
predictor variables. - Control a model used to control a response
variable by manipulating predictor variables. - Inferential a model used to explore the
strength of relationships between a response
variable and individual predictor variables. - Data summary a model used primarily as a device
to summarize a large set of data by a single
equation. - Often a model can be used for multiple purposes.
- The type of model dictates the type of data
needed.
53Step 2 Collect the data
- Decide the variables (predictor and response) on
which to collect data. Measurement of the
variables should be done the right way depending
on the type of subject. - See chapter 3 for precautions necessary to obtain
relevant, bias-free data.
54Step 3 Explore the data
- The data should be examined for outliers, gross
errors, missing values, etc. on a univariate
basis using the techniques discussed in chapter
4. Outliers cannot just be omitted because much
useful information can be lost. See chapter 10
for how to deal with outliers. - Scatter plots should be made to study bivariate
relationships between the response variable and
each of the predictors. They are useful in
suggesting possible transformations to linearize
the relationships.
55Step 4 Divide the data
- Divide the data into training and test sets only
a subset of the data, the training set, should be
used to fit the model (step 5 and 6) the
remainder, called the training set, should be
used for cross-validation of the fitted model
(step 7). - The reason for using an independent data set to
test the model is that if the same data are used
for both fitting and testing, then an
overoptimistic estimate of the predictive ability
of the fitted model is obtained. - The split for the two sets should be done
randomly.
56Step 5 fit Candidate models
- Generally several equally good models can be
identified using the training data set. - By conducts several runs by varying FIN and FOUT
values, we can identify several that fits the
training set.
57Step 6 Select and evaluate
- From the list of candidate models we are now
ready to select two or three good models based on
criteria such as the Cp-statistic, the number of
predictors (p), and the nature of predictors. - These selected models should be checked for
violation of model assumptions using standard
diagnostic techniques, in particular, residual
plots. Transformations in the response variable
or some of the predictor variables may be
necessary to improve model fits.
58Step 7 Select the Final model
- This is the step where we compare competing
models by cross-validating them against the test
data. - The model with a smaller cross-validation SSE is
better predictive model. - The final selection of the model is based on a
number of considerations, both statistical and no
statistical. These include residual plots,
outliers, parsimony, relevance, and ease of
measurement of predictors. A final test of any
model is that it makes practical sense and the
client is willing to buy it.
59Regression Diagnostics (Step VI)
- Graphical Analysis of Residuals
- Plot Estimated Errors vs. Xi Values
- Difference Between Actual Yi Predicted Yi
- Estimated Errors Are Called Residuals
- Plot Histogram or Stem--Leaf of Residuals
- Purposes
- Examine Functional Form (Linearity )
- Evaluate Violations of Assumptions
60Linear Regression Assumptions
- Mean of Probability Distribution of Error Is 0
- Probability Distribution of Error Has Constant
Variance - Probability Distribution of Error is Normal
- Errors Are Independent
61Residual Plot for Functional Form (Linearity)
Add X2 Term
Correct Specification
62Residual Plot for Equal Variance
Unequal Variance
Correct Specification
Fan-shaped.Standardized residuals used typically
(residual divided by standard error of
prediction)
63Residual Plot for Independence
Not Independent
Correct Specification
64Data transformations
- Why do we need data transformations?
- Make seemingly nonlinear models linear
- example
-
-
- Sometimes it gives a better explanation of the
variation in the data
65- How do we do the data transformations?
- Power family of transformations on the response
Box-Cox method - Requirements
-
- all the data is always
positive -
- The ratio of the largest observed Y to the
smallest is - at least 10
66- Transformation form
- V
-
- where is the geometric mean of the
-
67- How to estimate
- 1.Choose a value of from a selected range.
Usually we look for it in the range (-1,1),we
would usually cover the selected range with about
11-21 values of - 2.For each value, evaluate V by applying each
Y to the formula above. You will create a vector
V( ), then use it to fit a linear
model by least squares
method. Record the residual sum of squares for
the regression - 3. Plot versus .Draw a smooth
curve through the plotted points, and find at
what value of the lowest point of the curve
lies. That , is the maximum likelihood
estimate of -
68- Example
- The data in table are part of a more extensive
set given by Derringer(1974). This paper has been
adapted with permission of John Wiley Sons,
Inc. we wish to find a transformation of the form
, - or , which will provide a
good first-order fit to the data. Our model form
is where f is the filler
level and p is the plasticizer level.
69Naphthenic Oil,phr, p Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f
Naphthenic Oil,phr, p 0 12 24 36 48 60
0 26 38 50 76 108 157
10 17 26 37 53 83 124
20 13 20 27 37 57 87
30 ---- 15 22 27 41 63
70- Note that the response data range from 157 to 13,
a ratio of 157/1312.1gt10, hence a transformation
on Y is likely to be effective. The geometric
mean is 41.5461 for this set of data. - The next table shows a selected values of
- We pick 20 different values of from (-1,1)
in this case.
71-1.0 -0.8 -0.6 -0.4 -0.2 -0.15 -0.10 -0.08 -0.06 -0.05
2456 1453 779.1 354.7 131.7 104.5 88.3 84.9 83.3 83.2
-0.04 -0.02 0.00 0.05 0.10 0.2 0.4 0.6 0.8 1.0
83.5 85.5 89.3 106.7 135.9 231.1 588.0 1222 2243 3821
72- A smooth curve through these points is plotted in
the next figure. We see that the minimum
occurs at about -0.05. This is close to
zero, so suggesting that the transformation V
, or more simply .
73- Application of the transformation to the original
data, then we get a set of data which are better
linearly related. The best plane, fitted to these
transformed data by least squares, is - 3.2120.03088f-0.03152p.
- the ANOVA table for this model is
Source Df SS MS F
1 319.44855 -----
, 2 10.51667 5.27583 2045
Residual 20 0.05171 0.00258
Total 23 330.05193
74- If we had fitted a first-order model to the
untransformed data, we will obtain - 28.1841.55f-1.717p
- ANOVA table for this model
Source Df SS MS F
, 2 27842.62 13921.31 72.9
Residual 20 3820.60 191.03
Total, corrected 22 31663.22
75- We find out the transformed model has much
stronger F-value.
7611.6.1 -11.6.3Topics in Regression Modeling
Yikang Chai Tao Li
7711.6.1 Multicollinearity
- Def. The columns of the X matrix are exactly or
approximately linearly dependent. - It means the predictor variables are related.
- why are we concerned about it?
- This can cause serious numerical and
statistical difficulties in fitting the
regression model unless extra predictor
variables are deleted.
78How does the multicollinearity cause
difficulties?
- The multicollinearity leads to the following
problems - is nearly singular, which makes
numerically unstable. This reflected in large
changes in their magnitudes with small changes in
data. - The matrix has very
large elements. Therefore
are large, which makes statistically
nonsignificant.
79Measures of Multicollinearity
- Three ways
- The correlation matrix R. Easy but cant
reflect linear relationships between more than
two variables. -
- 2. Determinant of R can be used as measurement
of singularity of . - 3. Variance Inflation Factors (VIF) the
diagonal elements of . Generally, VIFgt10 is
regarded as unacceptable.
8011.6.2 Polynomial Regression
Consider the special case
Problems
- The powers of x, i.e., tend
to be highly correlated.
- If k is large, the magnitudes of these powers
tend to vary over a rather wide range.
These problems lead to numerical errors.
81How to solve these problems?
- Two ways
- 1. Centering the x-variableRemoving the
non-essential multicollinearity in the data.
2. Standardize the x-variable
Alleviate the problem that x varying over a wide
range.
8211.6.3 Dummy Predictor Variables
- Its an method to deal with the categorical
variables.
1.For ordinal categorical variables, such as the
prognosis of a patient (poor, average, good),
just assign numerical scores to the categories.
(poor1, average2, good3)
2. If we have nominal variable with cgt2
categories. Use c-1 indicator
variables, , called Dummy
Variables, to code.
83How to code?
set for the ith category,
for the cth category.
Why dont we just use c indicator variables
?
because there will be a linear dependency among
them This will cause multicollinearity.
84Example
- The season of a year can be coded with three
indicators x1(winter),x2(spring),x3(summer).
With this coding (1,0,0)for Winter ,(0,1,0) for
Spring, (0,0,1) for Summer and - (0,0,0) for Fall
Consider modeling the temperature of a year of an
area as a function of the season (X) and its
latitude (A) , we can get the following model
For winter
For summer
For spring
For fall
85Logistic Regression Model
- 1938, By R. A. Fisher and Frank Yates
- Logistic transform for analyzing binary data.
86Logistic Regression Model
- The Importance of Logistic Regression Model
- Logistic regression model is the most popular
model for binary data. - Logistic regression model is generally used for
binary response variables. -
- Y 1 (true, success, YES, etc.) , while Y
0 ( false, failure, NO, etc.)
87Logistic Regression Model
- Details of Regression Model
- Main Step
- Consider a response variable Y 0 or 1 and a
single predictor variable x. - Model E(Yx) P(Y1x) as a function of x. The
logistic regression model expresses the logistic
transform of P(Y1x).
88Logistic Regression Model
- Example
- http//faculty.vassar.edu/lowry/logreg1.html
i Ii iii iv v vi vii
X Instances of YCoded as Instances of YCoded as Totaliiiii Y asObservedProbability Y asOdds Ratio Y as LogOdds Ratio
X 0 Totaliiiii Y asObservedProbability Y asOdds Ratio Y as LogOdds Ratio
282930313233 432241 22771614 65992015 .3333.4000.7778.7778.8000.9333 .5000 .6667 3.5000 3.5000 4.0000 14.0000 -.6931 -.4055 1.2528 1.2528 1.3863 2.6391
89Logistic Regression Model
- A. Ordinary Linear Regression B. Logistic
Regression
90Logistic Regression Model
- Weighted Linear Regression ofObserved Log Odds
Ratios on X
X Observed Log Weight
28 0.3333 -.6931 6
29 0.4 -.4055 5
30 0.7778 1.2528 9
31 0.7778 1.2528 9
32 0.8 1.3863 20
33 0.9333 2.6391 15
91Logistic Regression Model
- Properties of Regression Model
- E(Yx) P(Y1 x) 1 P(Y0x) 0 P(Y1x)
is bounded between 0 and 1 for all values of x .
While, it is not true if we use model - In ordinary regression, the regression
coefficient has the interpretation that it
is the log of the odds ratio of a success event
(Y1) for a unit change in x. - Extension to Multiple predictor variables
92Standardized Regression Coefficients
- Why we need standardize regression coefficients?
- Recall the regression equation for linear
regression model -
- The magnitudes of the can not be directly
used to judge the relative effects of on y. - By using standardized regression coefficients, we
may be able to judge the importance of different
predictors
93Standardized Regression Coefficients
- Standardized Transform
- Standardized Regression Coefficients
94Standardized Regression Coefficients
- Example(Industrial sales data from text book)
- Linear Model
- The regression equation
- Notice
- but thus has a much larger
effect than on y
95Chapter Summary
- Multiple linear regression model
- Fitting the multiple regression model
- Least squares fit
- Goodness of fit of the model SSE,
SST, SSR, r2 - Statistical inference for multiple regression
- 1. T-test
- 2. F-test for
all for at
least one - 3. F-test
for at
least one - How do we select variables (SAS)?
- Stepwise regression - its fancy
algorithm - Best subsets regression more
realistic, flexible - How about if the data is not linear?
- Data transformation
- Building a multiple regression model 7 steps
96We very appreciate your attention )
- Please feel free to ask questions.
97The End