Multiple Regression

About This Presentation

Title:

Multiple Regression

Description:

There are many situations when a single independent variable is ... Slightly different versions of the F statistics can be used to test milder null hypotheses. ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 36

Provided by: jessicako

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Regression

1
Chapter 8

Multiple Regression

2
Introduction

The methods of simple linear regression,
discussed in Chapter 7, apply when we wish to fit
a linear model relating the value of an dependent
variable y to the value of a single independent
variable x.
There are many situations when a single
independent variable is not enough.
In situations like this, there are several
independent variables, x1,x2,,xp, that are
related to a dependent variable y.

3
Section 8.1 The Multiple Regression Model

Assume that we have a sample of n items and that
on each item we have measured a dependent
variable y and p independent variables,
x1,x2,,xp.
The ith sampled item gives rise to the ordered
set (yi,x1i,,xpi).
We can then fit the multiple regression model
yi ?0 ?1x1i ?pxpi ?i.

4
Various Multiple Linear Regression Models

Polynomial regression model (the independent
variables are all powers of a single variable)
Quadratic model (polynomial regression of model
of degree 2, and powers of several variables)
A variable that is the product of two other
variables is called an interaction.
These models are considered linear models, even
though they contain nonlinear terms in the
independent variables. The reason is that they
are linear in the coefficients ?i .

5
Estimating the Coefficients

In any multiple regression model, the estimates
are computed by least-squares,
just as in simple linear regression. The
equation
is called the least-squares equation or fitted
regression equation.
Now define to be the y coordinate of the
least-squares equation corresponding to the x
values (x1i,,xpi).
The residuals are the quantities
,which are the differences between the observed y
values and the y values given by the equation.
We want to compute so as to
minimize the sum of the squared residuals. This
is complicated and we rely on computers to
calculated them.

6
Sums of Squares

Much of the analysis in multiple regression is
based on three fundamental quantities.
They are regression sum of squares(SSR), the
error sum of squares(SSE), and the total sum of
squares(SST).
We defined these quantities in Chapter 7 and they
hold here as well.
The analysis of variance identity is
SST SSR SSE
The assumptions on the errors in Chapter 7 are
also used here.

7
Assumptions of the Error Terms

Assumptions for Errors in Linear Models
In the simplest situation, the following
assumptions are satisfied
The errors ?1,,?n are random and independent.
In particular, the magnitude of any error ?i does
not influence the value of the next error ?i1.
The errors ?1,,?n all have mean 0.
The errors ?1,,?n all have the same variance,
which we denote by ?2.
The errors ?1,,?n are normally distributed.

8
Mean and Variance of yi

In the multiple linear regression model
yi ?0 ?1x1i ?pxpi ?i.
under assumptions 1 through 4, the observations
y1,, yn are independent random variables that
follow the normal distribution. The mean and
variance of yi are given by
Each coefficient represents the change in the
mean of y associated with an increase of one unit
in the value of xi, when the other x variables
are held constant.

9
Statistics

The three statistics most often used in multiple
regression are the estimated error variance s2,
the coefficient of determination R2, and the F
statistic.
We have to adjust the estimated standard
deviation since we are estimating p 1
coefficients,
The estimated variance of each least-squares
coefficient is a complicated calculation and we
can find them on a computer.
In simple linear regression, the coefficient of
determination, R2, measures the goodness of fit
of the linear model. The goodness of fit
statistic in multiple regression denoted by R2 is
also called the coefficient of determination.
The value of R2 is calculated in the same way as
r2 in simple linear regression. That is, R2
SSR/SST.

10
Tests of Hypothesis

In simple linear regression, a test of the null
hypothesis ?1 0 is almost always made. If this
hypothesis is not rejected, then the linear model
may not be useful.
The test is multiple linear regression is
H0 ?1 ?2
?p 0. This is a very strong hypothesis. It
says that none of the independent variables has
any linear relationship with the dependent
variable.
The test statistic for this hypothesis is
F (SSR/p)/(SSE/(n
p 1)).
This is an F statistic and its null distribution
is Fp,n-p-1. Note that the denominator of the F
statistic is s2. The subscripts p, n-p-1 are the
degrees of freedom for the F statistic.
Slightly different versions of the F statistics
can be used to test milder null hypotheses.

11
Output

Insert output from p562-3
Highlight some of the output.

12
Interpreting Output

Much of the output is analogous to that of simple
linear regression.
1. The fitted regression equation is presented
near the top of the output.
2. Below that the coefficient estimates and their
estimated standard deviations.
3. Next to each standard deviation is the
Students t statistic for testing the null
hypotheses that the true value of the coefficient
is equal to 0.
4. The P-values for the tests are given in the
next column.

13
Analysis of Variance Table

5. The DF column gives the degrees of freedom,
the degrees of freedom for regression is equal to
the number of independent variables in the model.
The degrees of freedom for Residual Error is
the number of observations number of parameters
estimated. The total degrees of freedom is the
sum of the degrees of freedom for regression and
for error.
6. The next column is SS. This column gives the
sum of squares, the first is regression sum of
squares, SSR, the second is error sum of squares,
SSE, and the third is the total sum of squares,
SST SSR SSE.
7. The column MS is the column with the mean sum
of squares which is the sums of squares divided
by their respective degrees of freedom. Note
that the mean square error is equal to the
variance estimate, s2.

14
More on the ANOVA Table

8. The column labeled F presents the mean square
for regression divide by the mean square for
error.
9. This is the F statistic that we discussed
earlier that is used for testing the null
hypothesis that none of the independent variables
are related to the dependent variable.

15
Using the Output

From the output, we can use the fitted regression
equation to predict for future observations.
It is also possible to calculate residuals for a
value of y.
Constructing confidence interval for the
coefficient of the independent variables is also
possible from the output.

16
Checking Assumptions

It is important in multiple linear regression to
test the validity of the assumptions for errors
in the linear model.
Check plots of residuals versus fitted values,
normal probability plots of residuals, and plots
of residuals versus the order in which the
observations were made.
It is also a good idea to make plots of residuals
versus each of the independent variables. If the
residual plots indicate a violation of
assumptions, transformations can be tried.

17
Section 8.2 Confounding and Collinearity

Fitting separate models to each variable is not
the same as fitting the multivariate model.
Consider the following example There are 225
gas wells that received fracture treatment in
order to increase production. In this treatment,
fracture fluid, which consists of fluid mixed
with sand, is pumped into the well. The sand
holds open the cracks in the rock, thus
increasing the flow of gas.
We can use sand to predict production or fluid to
predict production. If we fit a simple model,
then sand and fluid in their models show up as
important predictors.

18
Example (cont.)

We might be tempted to conclude that increasing
the volume of fluid or the volume of sand would
increase production.
There is confounding in this situation. If we
increase the volume of fluid, then we also
increase the volume of sand.
If production depends only on the volume of sand,
there will still be a relationship in the data
between production and fluid, and vice versus.

19
Output

The following output shows regression lines using
just fluid or sand in the model. The regression
equation for each is given.
Insert output on p575

20
Output

This output is when we are using multiple linear
regression.
The equation of the line uses both variables,
since
Production -0.729 0.670 Fluid 0.148 Sand
Insert output from p576.

21
Solution

Multiple regression provides a way to resolve the
issue.
Here we fit a model with sand and fluid. From
this, we can determine which has an effect on
production.
Whether we fit this multiple model or a simple
model the R2 is not particularly high in any
case.
This indicates that there are other important
factors affecting production that have not been
included in the models

22
Collinearity

When two independent variables are very strongly
correlated, multiple regression may not be able
to determine which is the important one.
In this case, the variables are said to be
collinear.
The word collinear means to lie on the same line,
and when two variables are highly correlate,
their scatterplot is approximately a straight
line.
The word multicollinearity is sometimes used as
well, meaning that multiple variables are highly
correlated with each other.
When collinearity is present, the set of
independent variables is sometimes said to be
ill-conditioned.

23
Comments

Sometimes two variables are so correlated that
multiple regression cannot determine which is
responsible for the linear relationship with y.
In general, there is not much that can be done
when variables are collinear.
The only way to fix the situation is to collect
more data, including some values for the
independent variables that are not on a straight
line.

24
Section 8.3 Model Selection

There are many situations in which a large number
of independent variables have been measured, and
we need to decide which of them to include in the
model.
This is the problem of model selection, and it is
not an easy one.
Good model selection rests on this basic
principle known as Occams razor
The best scientific model is the simplest model
that explains the observed data.
In terms of linear models, Occams razor implies
the principle of parsimony
A model should contain the smallest number of
variables necessary to fit the data.

25
Some Exceptions

A linear model should always contain an
intercept, unless physical theory dictates
otherwise.
If a power xn of a variable is included in the
model, all lower powers x, x2, , xn-1 should be
included as well, unless physical theory dictates
otherwise.
If a product xy of two variables is included in a
model, then the variables x and y should be
included separately as well, unless physical
theory dictates otherwise.

26
Notes

Models that include only the variables needed to
fit the data are called parsimonious models.
Adding a variable to a model can substantially
change the coefficients of the variables already
in the model.

27
Can a Variable Be Dropped?

It often happens that one has formed a model that
contains a large number of independent variables,
and one wishes to determine whether a given
subset of them may be dropped from the model
without significantly reducing the accuracy of
the model.
Assume that we know that the model
yi?0 ?1x1i ?kxki?k1xk1i ?pxpi ?i
is
correct. We will call this the full model.
We wish to test the null hypothesis
H0?k1?p0.
If H0 is true, the model will remain correct if
we drop the variables xk1,xp, so we can replace
the full model with the following reduced model
yi?0 ?1x1i ?kxki ?i.

28
Test Statistic

To develop a test statistic for H0, we begin by
computing the error sums of squares for both the
full and reduced models.
We call this SSfull and SSreduced, respectively.
The number of degrees of freedom for SSfull is n
p 1, and for SSreduced is n k 1.
The test statistic is
f (SSreduced SSfull)/(p k)/SSfull/(n p
1)
If H0 is true, then f tends to be close to 1. If
H0 is false, then f tends to be larger.

29
Comments

This method is very useful for developing
parsimonious models by removing unnecessary
variables. However, the conditions under which
it is formally correct are rarely met.
More often, a large model is fit, some of the
variables are seen to have fairly large P-values,
and the F test is used to decide whether to drop
them from the model.

30
Best Subsets Regression

Assume that there are p independent variables,
x1,x2,,xp that are available to be put in the
model.
Lets assume that we wish to find a good model
that contains exactly four independent variables.
We can simply fit every possible model containing
four of the variables, and rank them in order of
their goodness-of-fit, as measured by the
coefficient of determination, R2.
The subset of four variables that yield the
largest value R2 of is the best subset of size
four.
One can repeat the process for subsets of other
sizes, finding the best subsets of size 1, 2,,
p.
These best subsets can be examined to see which
provides a good fit, while being parsimonious.

31
Output

In the Minitab output, there are several columns.
The column Vars tells how many variables are in
the model.
The second column is R-Sq, this is the R2 that
we just discussed. Here we would always pick the
full model since that is the best R2.
The third column is Adj. R-Sq, this is an
adjusted R2. This is a better measure of
association, since it takes into account the
number of variables in the model. Note that
adjusted R2 R2 - k/(n-k-1)(1- R2)
The value of k for which the value of adjusted R2
is a maximum can be used to determine the number
of variables in the model, and the best subset of
that size can be chosen as a model.
The fourth column is C-p it is another way to
determine the best model.

32
Stepwise Regression

This is the most widely use model selection
technique.
Its main advantage over best subsets regression
is that it is less computationally intensive, so
it can be used in situations where there are a
very large number of candidate independent
variables and too many possible subsets for every
one of them to be examined.
The user chooses two threshold P-values, ?in and
?out, with ?in lt ?out .
The stepwise regression procedure begins with a
step called a forward selection step, in which
the independent variables with smallest P-value
is selected, provided that P lt ?in.
This variable is entered in the model, creating a
model with a single independent variable.

33
More on Stepwise Regression

In the next step, the remaining variables are
examined one at a time as candidates for the
second variable in the model. The one with the
smallest P-value is added to the model, again
provided that P lt ?in.
Now, it is possible that adding the second
variables to the model increased the P-value of
the first variable. In the next step, called a
backward elimination step, the first variable is
dropped from the model if its P-value has grown
to exceed the value ?out .
The algorithm continues by alternating forward
selection steps with backward eliminations steps.
The algorithm terminates when no variables meet
the criteria for being added to or dropped from
the model.

34
Notes on Model Selection

When there is little or no physical theory to
rely on, many different models will fit the data
about equally well.
The methods for choosing a model involve
statistics, whose values depend on the data.
Therefore, if the experiment is repeated, these
statistics will come out differently, and
different models may appear to be best.
Some or all of the independent variables in a
selected model may not really be related to the
dependent variable. Whenever possible,
experiments should be repeated to test these
apparent relationships.
Model selection is an art, not a science.

35
Summary