Chapter 11 Multiple Linear Regression

About This Presentation

Title:

Chapter 11 Multiple Linear Regression

Description:

Chapter 11 Multiple Linear Regression Our Group Members: Content: Multiple Regression Model -----Yifan Wang Statistical Inference ---Shaonan Zhang & Yicheng Li ... – PowerPoint PPT presentation

Number of Views:462

Avg rating:3.0/5.0

Slides: 98

Provided by: Dogg2

Learn more at: http://www.ams.sunysb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 11 Multiple Linear Regression

1
Chapter 11 Multiple Linear
Regression
2
Our Group Members
3
Content

Multiple Regression Model -----Yifan Wang
Statistical Inference ---Shaonan Zhang Yicheng
Li
Variable Selection Methods SAS
---Guangtao Li Ruixue Wang
Strategy for Building a Model and Data
Transformation
---
Xiaoyu Zhang Siyuan Luo
Topics in Regression Modeling
----Yikang Chai Tao Li
Summary -----Xing Chen

4
Ch 11.1-11.3 Introduction to Multiple Linear
Regression

Yifan Wang
Dec. 6th, 2007

Based on Chapter 10, we studied how to fit a
linear relationship between a response variable y
and a predictor variable x.
But, sometimes we cannot handle a problem using
simple linear regression, when there are two or
more predictor variables.

For Example
The salary of a company employee may depend on
job category
years of experience
education
performance evaluations

6
?
What do we need to do

Extend the simple linear regression model to the
case of two or more predictor variables.
Multiple Linear Regression (or simply Multiple
Regression) is the statistical methodology used
to fit such models.

7
Multiple Linear Regression

In multiple regression we fit a model of the
form (excluding the error term)
Where are
predictor variables and are k1
unknown parameters.

linear
For example This model includes the kth degree
polynomial model in a single variable x,
namely, Since we can put
.
8
11.1 A Probabilistic Model For Multiple Linear
Regression

Regard the response variable as random
Regard the predictor variables as nonrandom.
The data for multiple regression consist of n
vectors of observations (
) for i 1,2,,n.

Example 1
The response variable the salary of the i
th person in the sample
The predictor variables his/her years of
experience
his/her years of education.

9
Example 2
is the observed value of the r.v..
depends on fixed
predictor values
according to the following
model
Where is a random error with 0, and
are unknown parameters. Assume are independent
random variables. Then the are independent
random variables with
10
11.2 Fitting the Multiple Regression Model
11.2.1 Least Squares (LS) Fit

The LS estimates of the unknown parameters
minimize
The LS can be found by setting the first partial
derivatives of Q with respect to
equal to zero.
The result is a set of simultaneous linear
equations in (k1) unknowns. The resulting
solutions, are the least squares
(LS) estimates of , respectively

11
11.2.2 Goodness of Fit of the Model

To access the goodness of fit of the LS model, we
use the residuals defined by
Where the are the fitted values
An overall measure of the goodness of fit is the
error sum of squares (SSE)
Compare it to the total sum of squares (SST)
As in Chapter 10, define the regression sum of
squares (SSR) given by

the coefficient of multiple determination

, values closer to 1 represent
better fits
Adding predictor variables generally increases
, thus can be made to approach 1 by
increasing the number of predictors.
Multiple correlation coefficient (the positive
square root of )
only positive square root is used
r is a measure of the strength of the
association between the predictor variables and
the one response variable

13
11.3 Multiple Regression Model in Matrix Notation

The multiple regression model can be presented
in a compact form by using matrix notation. Let

be the n x 1 vectors of the r.v.s , their
observed values , and random errors ,
respectively. Next let
be the n x (k1) matrix of the values of
predictor variables.
14
Finally Let
and

be the (k 1) x 1 vectors of unknown parameters
and their LS estimates, respectively
The model can be rewritten as

The simultaneous linear equations whose
solutions yields the LS estimates can be written
in matrix notation as

If the inverse of the matrix exists,
then the solution is given by

15
11.4 Statistical Inference

Shaonan Zhang Yicheng Li

16
Statistical Inference on ßs
----General Hypothesis Test

Determining the statistical significance of
predictor variables
we test the hypotheses
if we cant reject ,
can be dropped from the model

17
Statistical Inference on ßs
----General Hypothesis Test

Pivotal Quantity
recall
unbiased estimate of

error degrees of freedom
18
Statistical Inference on ßs
----General Hypothesis Test

Confidence Interval for
Noted that
So, the CI is
where

19
Statistical Inference on ßs
----General Hypothesis Test

Hypothesis Test
Specially, when 0, we reject H0 if

P (Reject H0 H0 is true) ?
20
Statistical Inference on ßs
----Another Hypothesis Test

Hypothesis
Pivotal Quantity
also,
P-value
If P-value is less than a, we reject H0. And
we use the previous test in this case.

21
Statistical Inference on ßs
----Another Hypothesis Test

ANOVA Table for Multiple Regression

Source of Variation (Source) Sum of Squares (SS) Degrees of Freedom (d.f.) Mean Square (MS) F
Regression Error SSR SSE k n - (k1)
Total SST n - 1
22
Statistical Inference on ßs
----Test Subsets of Parameters

Full Model
Partial Model
Hypothesis
test statistics
reject H0 when

23
Prediction of Future Observations

Let and
Whatever CI (Confidence Interval) or PI
(Prediction Interval)
we have
and
Pivotal Quantity
a (1-) level CI to estimate ?
a (1-) level PI to predict Y

24
11.7Variable Selection Methods
Guangtao Li, RuiXue Wang
25
1. Why do we need variable selection
methods?

2. Two methods are introduced
Stepwise Regression
Best Subsets Regression

26
11.7.1 STEPWISE REGRESSION

Guangtao Li

27
Recall Test for Subsets of Parameters in 11.4

Full model

(i1,2,n)

Partial model

(i1,2,n)
vs.

Hypotheses

for at least one

We test

Reject H0 when

(p-1)-variable model
P-variable model

29
Partial F-test

Reject H0p if

30
Partial Correlation Coefficients

We should add to the regression equation
only if is large enough, i.e., only if
is statistically significant.

31
Stepwise Regression Algorithm
32
SAS Program for the Algorithm

Example The Director of Broadcasting Operations
for a television station wants to study the issue
of standby hours, which are hours where
unionized graphic artists at the station are paid
but are not actually involved in any activity. We
are trying to predict the total number of Standby
Hours per Week (Y). Possible explanatory
variables are Total Staff Present (X1), Remote
Hours(X2), Dubner Hours (X3) and Total Labor
Hours (X4). The results for 26 weeks are given
below.

33
(No Transcript)
34
Data test input y x1 x2 x3 x4 datalines 245 338
414 323 2001 177 333 598 340 2030 271 358 656 340
2226 211 372 631 352 2154 196 339 528 380 2078 13
5 289 409 339 2080 195 334 382 331 2073 118 293 39
9 311 1758 116 325 343 328 1624 147 311 338 353 18
89 154 304 353 518 1988 146 312 289 440 2049 115 2
83 388 276 1796
35

161 307 402 207 1720
274 322 151 287 2056
245 335 228 290 1890
201 350 271 355 2187
183 339 440 300 2032
237 327 475 284 1856
175 328 347 337 2068
152 319 449 279 1813
188 325 336 244 1808
188 322 267 253 1834
197 317 235 272 1973
261 315 164 223 1839
232 331 270 272 1935
run
proc reg datatest
model y x1 x2 x3 x4 /SELECTION stepwise
run

Selected SAS Output
Stepwise
Selection Step 1
Variable x1 Entered
R-Square 0.3660 and C(p) 13.3215
Analysis of Variance
Sum of Mean
Source DF
Squares Square F Value Pr gt F
Model 1
20667 20667 13.86 0.0011
Error 24
35797 1491.55073
Corrected Total 25 56465
Parameter
Standard

37
Stepwise
Selection Step 2
Variable x2 Entered R-Square 0.4899 and C(p)
8.4193
Analysis of Variance
Sum of Mean
Source DF
Squares Square F Value Pr gt F
Model 2 27663
13831 11.05 0.0004
Error 23 28802
1252.26402 Corrected Total 25
56465
Parameter Standard Variable
Estimate Error Type II SS
F Value Pr gt F Intercept
-330.67483 116.48022 10092
8.06 0.0093 x1
1.76486 0.37904 27149
21.68 0.0001 x2
-0.13897 0.05880 6995.14489
5.59 0.0269
38
SAS Output(cont)

All variables left in the model are significant
at the 0.1500 level.
No other variable met the 0.1500 significance
level for entry into the model.
Summary of
Stepwise Selection
Variable Variable Number
Partial Model
Step Entered Removed Vars In
R-Square R-Square C(p) F Value Pr
gt F
1 x1 1
0.3660 0.3660 13.3215 13.86
0.0011
2 x2 2
0.1239 0.4899 8.4193 5.59
0.0269

Ruixue Wang

40
11.7.2 Best Subsets Regression
41
11.7.2 Best Subsets Regression

In practice there are often several almost
equally good models, and the choice of the final
model may depend on side considerations such as
the number of variables, the ease of observing
and/or controlling variables, etc. The best
subsets regression algorithm permits
determination of a specified number of best
subsets of size p1,2,,k from which the choice
of the final model can be made by the
investigator.

42
11.7.2 Best Subsets Regression

Optimality Criteria
rp2-Criterion

Adjusted rp2-Criterion

43
Cp-Criterion (recommended for its ease of
computation and its ability to judge the
predictive power of a model)

The sample estimator, Mallows Cp-statistic, is
given by
is an almost unbiased estimator of

44
PRESS p Criterion The total prediction
error sum of squares (press) is This
criterion evaluates the predictive ability of a
postulated model by omitting one observation at a
time, fitting the model based on the remaining
observations and computing the predicted value
for the omitted observation.The PRESS p
criterion is intuitively easier to grasp than the
Cp-Criterion , but it is computationally much
more intensive and is not available in many
packages.
45
SAS PRGRAM

Data test
input y x1 x2 x3 x4
datalines
245 338 414 323 2001
177 333 598 340 2030
271 358 656 340 2226
211 372 631 352 2154
196 339 528 380 2078
135 289 409 339 2080
195 334 382 331 2073
118 293 399 311 1758
116 325 343 328 1624
147 311 338 353 1889
154 304 353 518 1988
146 312 289 440 2049
115 283 388 276 1796

46
SAS PRGRAM

161 307 402 207 1720
274 322 151 287 2056
245 335 228 290 1890
201 350 271 355 2187
183 339 440 300 2032
237 327 475 284 1856
175 328 347 337 2068
152 319 449 279 1813
188 325 336 244 1808
188 322 267 253 1834
197 317 235 272 1973
261 315 164 223 1839
232 331 270 272 1935
run
proc reg datatest
model y x1 x2 x3 x4 /SELECTION RSQUARE
adjrsq CP mse
run

47
Results

Number in Adjusted
Model R-Square R-Square
C(p) MSE Variables in Model
1 0.3660 0.3396
13.3215 1491.55073 x1
1 0.1710 0.1365
24.1846 1950.27491 x4
1 0.0597 0.0205
30.3884 2212.24598 x3
1 0.0091 -.0322
33.2078 2331.30545 x2
----------------------------------------
------------------------------------------
2 0.4899 0.4456
8.4193 1252.26402 x1 x2
2 0.4499 0.4021
10.6486 1350.49234 x1 x3
2 0.4288 0.3791
11.8231 1402.24672 x3 x4
2 0.3754 0.3211
14.7982 1533.34044 x1 x4
2 0.2238 0.1563
23.2481 1905.67595 x2 x4
2 0.0612 -.0205
32.3067 2304.83375 x2 x3
----------------------------------------
------------------------------------------
3 0.5378 0.4748
7.7517 1186.29444 x1 x3 x4
3 0.5362 0.4729
7.8418 1190.44739 x1 x2 x3
3 0.5092 0.4423
9.3449 1259.69053 x1 x2 x4

48
11.7.2 Best Subsets Regression SAS

The resource of the example is http//www.math.ude
l.edu/teaching/course_materials/m202_climent/Multi
ple20Regression20-20Model20Building.pdf

49
11.5, 11.8 Building A Multiple Regression
Modelby SiYuan Luo Xiaoyu Zhang
50
Introduction

Building a multiple regression model consists of
7 steps.
Though it is not necessary to follow each and
every step in exact sequence shown on the next
slide, the general approach and major steps
should be followed.
The model is an iterative process, it may take
several cycles of the steps before arriving at
the final model.

51
The 7 steps
1.Decide the type
3.Explore the data
2.Collect the data
5.Fit candidate models
4.Divide the data
6.Select and evaluate
7.Select the final model
52
Step 1 Decide the type

Decide the type of model needed, different types
of models includes
Predictive a model used to predict the response
variable from a chosen set of predictor
variables.
Theoretical a model based on a theoretical
relationship between a response variable and
predictor variables.
Control a model used to control a response
variable by manipulating predictor variables.
Inferential a model used to explore the
strength of relationships between a response
variable and individual predictor variables.
Data summary a model used primarily as a device
to summarize a large set of data by a single
equation.
Often a model can be used for multiple purposes.
The type of model dictates the type of data
needed.

53
Step 2 Collect the data

Decide the variables (predictor and response) on
which to collect data. Measurement of the
variables should be done the right way depending
on the type of subject.
See chapter 3 for precautions necessary to obtain
relevant, bias-free data.

54
Step 3 Explore the data

The data should be examined for outliers, gross
errors, missing values, etc. on a univariate
basis using the techniques discussed in chapter
4. Outliers cannot just be omitted because much
useful information can be lost. See chapter 10
for how to deal with outliers.
Scatter plots should be made to study bivariate
relationships between the response variable and
each of the predictors. They are useful in
suggesting possible transformations to linearize
the relationships.

55
Step 4 Divide the data

Divide the data into training and test sets only
a subset of the data, the training set, should be
used to fit the model (step 5 and 6) the
remainder, called the training set, should be
used for cross-validation of the fitted model
(step 7).
The reason for using an independent data set to
test the model is that if the same data are used
for both fitting and testing, then an
overoptimistic estimate of the predictive ability
of the fitted model is obtained.
The split for the two sets should be done
randomly.

56
Step 5 fit Candidate models

Generally several equally good models can be
identified using the training data set.
By conducts several runs by varying FIN and FOUT
values, we can identify several that fits the
training set.

57
Step 6 Select and evaluate

From the list of candidate models we are now
ready to select two or three good models based on
criteria such as the Cp-statistic, the number of
predictors (p), and the nature of predictors.
These selected models should be checked for
violation of model assumptions using standard
diagnostic techniques, in particular, residual
plots. Transformations in the response variable
or some of the predictor variables may be
necessary to improve model fits.

58
Step 7 Select the Final model

This is the step where we compare competing
models by cross-validating them against the test
data.
The model with a smaller cross-validation SSE is
better predictive model.
The final selection of the model is based on a
number of considerations, both statistical and no
statistical. These include residual plots,
outliers, parsimony, relevance, and ease of
measurement of predictors. A final test of any
model is that it makes practical sense and the
client is willing to buy it.

59
Regression Diagnostics (Step VI)

Graphical Analysis of Residuals
Plot Estimated Errors vs. Xi Values
Difference Between Actual Yi Predicted Yi
Estimated Errors Are Called Residuals
Plot Histogram or Stem--Leaf of Residuals
Purposes
Examine Functional Form (Linearity )
Evaluate Violations of Assumptions

60
Linear Regression Assumptions

Mean of Probability Distribution of Error Is 0
Probability Distribution of Error Has Constant
Variance
Probability Distribution of Error is Normal
Errors Are Independent

61
Residual Plot for Functional Form (Linearity)
Add X2 Term
Correct Specification
62
Residual Plot for Equal Variance
Unequal Variance
Correct Specification
Fan-shaped.Standardized residuals used typically
(residual divided by standard error of
prediction)
63
Residual Plot for Independence
Not Independent
Correct Specification
64
Data transformations

Why do we need data transformations?
Make seemingly nonlinear models linear
example
Sometimes it gives a better explanation of the
variation in the data

How do we do the data transformations?
Power family of transformations on the response
Box-Cox method
Requirements
all the data is always
positive
The ratio of the largest observed Y to the
smallest is
at least 10

Transformation form
V
where is the geometric mean of the

How to estimate
1.Choose a value of from a selected range.
Usually we look for it in the range (-1,1),we
would usually cover the selected range with about
11-21 values of
2.For each value, evaluate V by applying each
Y to the formula above. You will create a vector
V( ), then use it to fit a linear
model by least squares
method. Record the residual sum of squares for
the regression
3. Plot versus .Draw a smooth
curve through the plotted points, and find at
what value of the lowest point of the curve
lies. That , is the maximum likelihood
estimate of

Example
The data in table are part of a more extensive
set given by Derringer(1974). This paper has been
adapted with permission of John Wiley Sons,
Inc. we wish to find a transformation of the form
,
or , which will provide a
good first-order fit to the data. Our model form
is where f is the filler
level and p is the plasticizer level.

69
Naphthenic Oil,phr, p Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f Filler, phr, f
Naphthenic Oil,phr, p 0 12 24 36 48 60
0 26 38 50 76 108 157
10 17 26 37 53 83 124
20 13 20 27 37 57 87
30 ---- 15 22 27 41 63
70

Note that the response data range from 157 to 13,
a ratio of 157/1312.1gt10, hence a transformation
on Y is likely to be effective. The geometric
mean is 41.5461 for this set of data.
The next table shows a selected values of
We pick 20 different values of from (-1,1)
in this case.

71
-1.0 -0.8 -0.6 -0.4 -0.2 -0.15 -0.10 -0.08 -0.06 -0.05
2456 1453 779.1 354.7 131.7 104.5 88.3 84.9 83.3 83.2
-0.04 -0.02 0.00 0.05 0.10 0.2 0.4 0.6 0.8 1.0
83.5 85.5 89.3 106.7 135.9 231.1 588.0 1222 2243 3821
72

A smooth curve through these points is plotted in
the next figure. We see that the minimum
occurs at about -0.05. This is close to
zero, so suggesting that the transformation V
, or more simply .

Application of the transformation to the original
data, then we get a set of data which are better
linearly related. The best plane, fitted to these
transformed data by least squares, is
3.2120.03088f-0.03152p.
the ANOVA table for this model is

Source Df SS MS F
1 319.44855 -----
, 2 10.51667 5.27583 2045
Residual 20 0.05171 0.00258
Total 23 330.05193
74

If we had fitted a first-order model to the
untransformed data, we will obtain
28.1841.55f-1.717p
ANOVA table for this model

Source Df SS MS F
, 2 27842.62 13921.31 72.9
Residual 20 3820.60 191.03
Total, corrected 22 31663.22

75

We find out the transformed model has much
stronger F-value.

76
11.6.1 -11.6.3Topics in Regression Modeling
Yikang Chai Tao Li
77
11.6.1 Multicollinearity

Def. The columns of the X matrix are exactly or
approximately linearly dependent.
It means the predictor variables are related.
why are we concerned about it?
This can cause serious numerical and
statistical difficulties in fitting the
regression model unless extra predictor
variables are deleted.

78
How does the multicollinearity cause
difficulties?

The multicollinearity leads to the following
problems
is nearly singular, which makes
numerically unstable. This reflected in large
changes in their magnitudes with small changes in
data.
The matrix has very
large elements. Therefore
are large, which makes statistically
nonsignificant.

79
Measures of Multicollinearity

Three ways
The correlation matrix R. Easy but cant
reflect linear relationships between more than
two variables.
2. Determinant of R can be used as measurement
of singularity of .
3. Variance Inflation Factors (VIF) the
diagonal elements of . Generally, VIFgt10 is
regarded as unacceptable.

80
11.6.2 Polynomial Regression
Consider the special case
Problems

The powers of x, i.e., tend
to be highly correlated.

If k is large, the magnitudes of these powers
tend to vary over a rather wide range.

These problems lead to numerical errors.
81
How to solve these problems?

Two ways
1. Centering the x-variableRemoving the
non-essential multicollinearity in the data.

2. Standardize the x-variable
Alleviate the problem that x varying over a wide
range.
82
11.6.3 Dummy Predictor Variables

Its an method to deal with the categorical
variables.

1.For ordinal categorical variables, such as the
prognosis of a patient (poor, average, good),
just assign numerical scores to the categories.
(poor1, average2, good3)
2. If we have nominal variable with cgt2
categories. Use c-1 indicator
variables, , called Dummy
Variables, to code.
83
How to code?
set for the ith category,
for the cth category.
Why dont we just use c indicator variables
?
because there will be a linear dependency among
them This will cause multicollinearity.
84
Example

The season of a year can be coded with three
indicators x1(winter),x2(spring),x3(summer).
With this coding (1,0,0)for Winter ,(0,1,0) for
Spring, (0,0,1) for Summer and
(0,0,0) for Fall

Consider modeling the temperature of a year of an
area as a function of the season (X) and its
latitude (A) , we can get the following model
For winter
For summer
For spring
For fall
85
Logistic Regression Model

1938, By R. A. Fisher and Frank Yates
Logistic transform for analyzing binary data.

86
Logistic Regression Model

The Importance of Logistic Regression Model
Logistic regression model is the most popular
model for binary data.
Logistic regression model is generally used for
binary response variables.
Y 1 (true, success, YES, etc.) , while Y
0 ( false, failure, NO, etc.)

87
Logistic Regression Model

Details of Regression Model
Main Step
Consider a response variable Y 0 or 1 and a
single predictor variable x.
Model E(Yx) P(Y1x) as a function of x. The
logistic regression model expresses the logistic
transform of P(Y1x).

88
Logistic Regression Model

Example
http//faculty.vassar.edu/lowry/logreg1.html

i Ii iii iv v vi vii
X Instances of YCoded as Instances of YCoded as Totaliiiii Y asObservedProbability Y asOdds Ratio Y as LogOdds Ratio
X 0 Totaliiiii Y asObservedProbability Y asOdds Ratio Y as LogOdds Ratio
282930313233 432241 22771614 65992015 .3333.4000.7778.7778.8000.9333 .5000 .6667 3.5000 3.5000 4.0000 14.0000 -.6931 -.4055 1.2528 1.2528 1.3863 2.6391
89
Logistic Regression Model

A. Ordinary Linear Regression B. Logistic
Regression

90
Logistic Regression Model

Weighted Linear Regression ofObserved Log Odds
Ratios on X

X Observed Log Weight
28 0.3333 -.6931   6
29 0.4 -.4055   5
30 0.7778 1.2528   9
31 0.7778 1.2528   9
32 0.8 1.3863   20
33 0.9333 2.6391   15
91
Logistic Regression Model

Properties of Regression Model
E(Yx) P(Y1 x) 1 P(Y0x) 0 P(Y1x)
is bounded between 0 and 1 for all values of x .
While, it is not true if we use model
In ordinary regression, the regression
coefficient has the interpretation that it
is the log of the odds ratio of a success event
(Y1) for a unit change in x.
Extension to Multiple predictor variables

92
Standardized Regression Coefficients

Why we need standardize regression coefficients?
Recall the regression equation for linear
regression model
The magnitudes of the can not be directly
used to judge the relative effects of on y.
By using standardized regression coefficients, we
may be able to judge the importance of different
predictors

93
Standardized Regression Coefficients