Multiple Regression - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Multiple Regression

Description:

There are many situations when a single independent variable is ... Slightly different versions of the F statistics can be used to test milder null hypotheses. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 36
Provided by: jessicako
Category:

less

Transcript and Presenter's Notes

Title: Multiple Regression


1
Chapter 8
  • Multiple Regression

2
Introduction
  • The methods of simple linear regression,
    discussed in Chapter 7, apply when we wish to fit
    a linear model relating the value of an dependent
    variable y to the value of a single independent
    variable x.
  • There are many situations when a single
    independent variable is not enough.
  • In situations like this, there are several
    independent variables, x1,x2,,xp, that are
    related to a dependent variable y.

3
Section 8.1 The Multiple Regression Model
  • Assume that we have a sample of n items and that
    on each item we have measured a dependent
    variable y and p independent variables,
    x1,x2,,xp.
  • The ith sampled item gives rise to the ordered
    set (yi,x1i,,xpi).
  • We can then fit the multiple regression model
    yi ?0 ?1x1i ?pxpi ?i.

4
Various Multiple Linear Regression Models
  • Polynomial regression model (the independent
    variables are all powers of a single variable)
  • Quadratic model (polynomial regression of model
    of degree 2, and powers of several variables)
  • A variable that is the product of two other
    variables is called an interaction.
  • These models are considered linear models, even
    though they contain nonlinear terms in the
    independent variables. The reason is that they
    are linear in the coefficients ?i .

5
Estimating the Coefficients
  • In any multiple regression model, the estimates
    are computed by least-squares,
    just as in simple linear regression. The
    equation
  • is called the least-squares equation or fitted
    regression equation.
  • Now define to be the y coordinate of the
    least-squares equation corresponding to the x
    values (x1i,,xpi).
  • The residuals are the quantities
    ,which are the differences between the observed y
    values and the y values given by the equation.
  • We want to compute so as to
    minimize the sum of the squared residuals. This
    is complicated and we rely on computers to
    calculated them.

6
Sums of Squares
  • Much of the analysis in multiple regression is
    based on three fundamental quantities.
  • They are regression sum of squares(SSR), the
    error sum of squares(SSE), and the total sum of
    squares(SST).
  • We defined these quantities in Chapter 7 and they
    hold here as well.
  • The analysis of variance identity is
  • SST SSR SSE
  • The assumptions on the errors in Chapter 7 are
    also used here.

7
Assumptions of the Error Terms
  • Assumptions for Errors in Linear Models
  • In the simplest situation, the following
    assumptions are satisfied
  • The errors ?1,,?n are random and independent.
    In particular, the magnitude of any error ?i does
    not influence the value of the next error ?i1.
  • The errors ?1,,?n all have mean 0.
  • The errors ?1,,?n all have the same variance,
    which we denote by ?2.
  • The errors ?1,,?n are normally distributed.

8
Mean and Variance of yi
  • In the multiple linear regression model
  • yi ?0 ?1x1i ?pxpi ?i.
  • under assumptions 1 through 4, the observations
    y1,, yn are independent random variables that
    follow the normal distribution. The mean and
    variance of yi are given by
  • Each coefficient represents the change in the
    mean of y associated with an increase of one unit
    in the value of xi, when the other x variables
    are held constant.

9
Statistics
  • The three statistics most often used in multiple
    regression are the estimated error variance s2,
    the coefficient of determination R2, and the F
    statistic.
  • We have to adjust the estimated standard
    deviation since we are estimating p 1
    coefficients,
  • The estimated variance of each least-squares
    coefficient is a complicated calculation and we
    can find them on a computer.
  • In simple linear regression, the coefficient of
    determination, R2, measures the goodness of fit
    of the linear model. The goodness of fit
    statistic in multiple regression denoted by R2 is
    also called the coefficient of determination.
    The value of R2 is calculated in the same way as
    r2 in simple linear regression. That is, R2
    SSR/SST.

10
Tests of Hypothesis
  • In simple linear regression, a test of the null
    hypothesis ?1 0 is almost always made. If this
    hypothesis is not rejected, then the linear model
    may not be useful.
  • The test is multiple linear regression is
    H0 ?1 ?2
    ?p 0. This is a very strong hypothesis. It
    says that none of the independent variables has
    any linear relationship with the dependent
    variable.
  • The test statistic for this hypothesis is
    F (SSR/p)/(SSE/(n
    p 1)).
  • This is an F statistic and its null distribution
    is Fp,n-p-1. Note that the denominator of the F
    statistic is s2. The subscripts p, n-p-1 are the
    degrees of freedom for the F statistic.
  • Slightly different versions of the F statistics
    can be used to test milder null hypotheses.

11
Output
  • Insert output from p562-3
  • Highlight some of the output.

12
Interpreting Output
  • Much of the output is analogous to that of simple
    linear regression.
  • 1. The fitted regression equation is presented
    near the top of the output.
  • 2. Below that the coefficient estimates and their
    estimated standard deviations.
  • 3. Next to each standard deviation is the
    Students t statistic for testing the null
    hypotheses that the true value of the coefficient
    is equal to 0.
  • 4. The P-values for the tests are given in the
    next column.

13
Analysis of Variance Table
  • 5. The DF column gives the degrees of freedom,
    the degrees of freedom for regression is equal to
    the number of independent variables in the model.
    The degrees of freedom for Residual Error is
    the number of observations number of parameters
    estimated. The total degrees of freedom is the
    sum of the degrees of freedom for regression and
    for error.
  • 6. The next column is SS. This column gives the
    sum of squares, the first is regression sum of
    squares, SSR, the second is error sum of squares,
    SSE, and the third is the total sum of squares,
    SST SSR SSE.
  • 7. The column MS is the column with the mean sum
    of squares which is the sums of squares divided
    by their respective degrees of freedom. Note
    that the mean square error is equal to the
    variance estimate, s2.

14
More on the ANOVA Table
  • 8. The column labeled F presents the mean square
    for regression divide by the mean square for
    error.
  • 9. This is the F statistic that we discussed
    earlier that is used for testing the null
    hypothesis that none of the independent variables
    are related to the dependent variable.

15
Using the Output
  • From the output, we can use the fitted regression
    equation to predict for future observations.
  • It is also possible to calculate residuals for a
    value of y.
  • Constructing confidence interval for the
    coefficient of the independent variables is also
    possible from the output.

16
Checking Assumptions
  • It is important in multiple linear regression to
    test the validity of the assumptions for errors
    in the linear model.
  • Check plots of residuals versus fitted values,
    normal probability plots of residuals, and plots
    of residuals versus the order in which the
    observations were made.
  • It is also a good idea to make plots of residuals
    versus each of the independent variables. If the
    residual plots indicate a violation of
    assumptions, transformations can be tried.

17
Section 8.2 Confounding and Collinearity
  • Fitting separate models to each variable is not
    the same as fitting the multivariate model.
  • Consider the following example There are 225
    gas wells that received fracture treatment in
    order to increase production. In this treatment,
    fracture fluid, which consists of fluid mixed
    with sand, is pumped into the well. The sand
    holds open the cracks in the rock, thus
    increasing the flow of gas.
  • We can use sand to predict production or fluid to
    predict production. If we fit a simple model,
    then sand and fluid in their models show up as
    important predictors.

18
Example (cont.)
  • We might be tempted to conclude that increasing
    the volume of fluid or the volume of sand would
    increase production.
  • There is confounding in this situation. If we
    increase the volume of fluid, then we also
    increase the volume of sand.
  • If production depends only on the volume of sand,
    there will still be a relationship in the data
    between production and fluid, and vice versus.

19
Output
  • The following output shows regression lines using
    just fluid or sand in the model. The regression
    equation for each is given.
  • Insert output on p575

20
Output
  • This output is when we are using multiple linear
    regression.
  • The equation of the line uses both variables,
    since
  • Production -0.729 0.670 Fluid 0.148 Sand
  • Insert output from p576.

21
Solution
  • Multiple regression provides a way to resolve the
    issue.
  • Here we fit a model with sand and fluid. From
    this, we can determine which has an effect on
    production.
  • Whether we fit this multiple model or a simple
    model the R2 is not particularly high in any
    case.
  • This indicates that there are other important
    factors affecting production that have not been
    included in the models

22
Collinearity
  • When two independent variables are very strongly
    correlated, multiple regression may not be able
    to determine which is the important one.
  • In this case, the variables are said to be
    collinear.
  • The word collinear means to lie on the same line,
    and when two variables are highly correlate,
    their scatterplot is approximately a straight
    line.
  • The word multicollinearity is sometimes used as
    well, meaning that multiple variables are highly
    correlated with each other.
  • When collinearity is present, the set of
    independent variables is sometimes said to be
    ill-conditioned.

23
Comments
  • Sometimes two variables are so correlated that
    multiple regression cannot determine which is
    responsible for the linear relationship with y.
  • In general, there is not much that can be done
    when variables are collinear.
  • The only way to fix the situation is to collect
    more data, including some values for the
    independent variables that are not on a straight
    line.

24
Section 8.3 Model Selection
  • There are many situations in which a large number
    of independent variables have been measured, and
    we need to decide which of them to include in the
    model.
  • This is the problem of model selection, and it is
    not an easy one.
  • Good model selection rests on this basic
    principle known as Occams razor
  • The best scientific model is the simplest model
    that explains the observed data.
  • In terms of linear models, Occams razor implies
    the principle of parsimony
  • A model should contain the smallest number of
    variables necessary to fit the data.

25
Some Exceptions
  • A linear model should always contain an
    intercept, unless physical theory dictates
    otherwise.
  • If a power xn of a variable is included in the
    model, all lower powers x, x2, , xn-1 should be
    included as well, unless physical theory dictates
    otherwise.
  • If a product xy of two variables is included in a
    model, then the variables x and y should be
    included separately as well, unless physical
    theory dictates otherwise.

26
Notes
  • Models that include only the variables needed to
    fit the data are called parsimonious models.
  • Adding a variable to a model can substantially
    change the coefficients of the variables already
    in the model.

27
Can a Variable Be Dropped?
  • It often happens that one has formed a model that
    contains a large number of independent variables,
    and one wishes to determine whether a given
    subset of them may be dropped from the model
    without significantly reducing the accuracy of
    the model.
  • Assume that we know that the model
  • yi?0 ?1x1i ?kxki?k1xk1i ?pxpi ?i
    is
  • correct. We will call this the full model.
  • We wish to test the null hypothesis
    H0?k1?p0.
  • If H0 is true, the model will remain correct if
    we drop the variables xk1,xp, so we can replace
    the full model with the following reduced model
    yi?0 ?1x1i ?kxki ?i.

28
Test Statistic
  • To develop a test statistic for H0, we begin by
    computing the error sums of squares for both the
    full and reduced models.
  • We call this SSfull and SSreduced, respectively.
  • The number of degrees of freedom for SSfull is n
    p 1, and for SSreduced is n k 1.
  • The test statistic is
  • f (SSreduced SSfull)/(p k)/SSfull/(n p
    1)
  • If H0 is true, then f tends to be close to 1. If
    H0 is false, then f tends to be larger.

29
Comments
  • This method is very useful for developing
    parsimonious models by removing unnecessary
    variables. However, the conditions under which
    it is formally correct are rarely met.
  • More often, a large model is fit, some of the
    variables are seen to have fairly large P-values,
    and the F test is used to decide whether to drop
    them from the model.

30
Best Subsets Regression
  • Assume that there are p independent variables,
    x1,x2,,xp that are available to be put in the
    model.
  • Lets assume that we wish to find a good model
    that contains exactly four independent variables.
  • We can simply fit every possible model containing
    four of the variables, and rank them in order of
    their goodness-of-fit, as measured by the
    coefficient of determination, R2.
  • The subset of four variables that yield the
    largest value R2 of is the best subset of size
    four.
  • One can repeat the process for subsets of other
    sizes, finding the best subsets of size 1, 2,,
    p.
  • These best subsets can be examined to see which
    provides a good fit, while being parsimonious.

31
Output
  • In the Minitab output, there are several columns.
  • The column Vars tells how many variables are in
    the model.
  • The second column is R-Sq, this is the R2 that
    we just discussed. Here we would always pick the
    full model since that is the best R2.
  • The third column is Adj. R-Sq, this is an
    adjusted R2. This is a better measure of
    association, since it takes into account the
    number of variables in the model. Note that
    adjusted R2 R2 - k/(n-k-1)(1- R2)
  • The value of k for which the value of adjusted R2
    is a maximum can be used to determine the number
    of variables in the model, and the best subset of
    that size can be chosen as a model.
  • The fourth column is C-p it is another way to
    determine the best model.

32
Stepwise Regression
  • This is the most widely use model selection
    technique.
  • Its main advantage over best subsets regression
    is that it is less computationally intensive, so
    it can be used in situations where there are a
    very large number of candidate independent
    variables and too many possible subsets for every
    one of them to be examined.
  • The user chooses two threshold P-values, ?in and
    ?out, with ?in lt ?out .
  • The stepwise regression procedure begins with a
    step called a forward selection step, in which
    the independent variables with smallest P-value
    is selected, provided that P lt ?in.
  • This variable is entered in the model, creating a
    model with a single independent variable.

33
More on Stepwise Regression
  • In the next step, the remaining variables are
    examined one at a time as candidates for the
    second variable in the model. The one with the
    smallest P-value is added to the model, again
    provided that P lt ?in.
  • Now, it is possible that adding the second
    variables to the model increased the P-value of
    the first variable. In the next step, called a
    backward elimination step, the first variable is
    dropped from the model if its P-value has grown
    to exceed the value ?out .
  • The algorithm continues by alternating forward
    selection steps with backward eliminations steps.
  • The algorithm terminates when no variables meet
    the criteria for being added to or dropped from
    the model.

34
Notes on Model Selection
  • When there is little or no physical theory to
    rely on, many different models will fit the data
    about equally well.
  • The methods for choosing a model involve
    statistics, whose values depend on the data.
    Therefore, if the experiment is repeated, these
    statistics will come out differently, and
    different models may appear to be best.
  • Some or all of the independent variables in a
    selected model may not really be related to the
    dependent variable. Whenever possible,
    experiments should be repeated to test these
    apparent relationships.
  • Model selection is an art, not a science.

35
Summary
  • In this chapter, we learned about
  • multiple regression models
  • estimating the coefficients
  • checking assumptions in multiple regression
  • confounding and collinearity
  • model selection
Write a Comment
User Comments (0)
About PowerShow.com