Chapter 11 Simple linear regression and correlation PowerPoint PPT Presentation

presentation player overlay
1 / 51
About This Presentation
Transcript and Presenter's Notes

Title: Chapter 11 Simple linear regression and correlation


1
Chapter 11Simple linear regression and
correlation
2
Empirical models
  • Many problems in engineering and science involve
    exploring the relationships between two or more
    variables. Regression analysis is a statistical
    technique that is very useful for these types of
    problems.
  • For example, in a chemical process, suppose that
    the yield of the product is related to the
    process-operating temperature. Regression
    analysis can be used to build a model to predict
    yield at a given temperature level. This model
    can also be used for process optimization, such
    as finding the level of temperature that
    maximizes yield, or for process control purposes.

3
Empirical models (Cont.)
  • Table 11-1 y is the purity of oxygen produced
    in a chemical distillation process, and x is the
    percentage of hydrocarbons that are present in
    the main condenser the distillation unit of the
    data in Table 11-1.

4
(No Transcript)
5
(No Transcript)
6
Empirical models (Cont.)
  • Although no simple curve will pass exactly
    through all the points, there is a strong
    indication that the points lie scattered randomly
    around a straight line.
  • It is probably reasonable to assume that the mean
    of the random variable Y is related to x by the
    following straight-line relationship
  • E(Yx) ?Yx ?0 ?1 x
  • Where slope and intercept are called regression
    coefficients.

7
Empirical models (Cont.)
  • Generalization can be done to a probabilistic
    linear model by assuming that
  • The expected value of Y is a linear function of x
  • For a fixed value of x the actual value of Y is
    determined by the mean value function (the linear
    model) plus a random error
  • Y ?0 ?1 x ?
  • where ? is the random error term.
  • We will call this model the simple linear
    regression model, because it has only one
    independent variable or regressor.

8
Empirical models (Cont.)
  • Sometimes a model will arise from a theoretical
    relationship.
  • At other times, we will have no theoretical
    knowledge of the relationship between x and y,
    and the choice of the model is based on
    inspection of a scatter diagram. We then think
    of the regression model as an empirical model.

9
(No Transcript)
10
Empirical models (Cont.)
  • Suppose that we can fix the value of x and
    observe the value of the random variable Y.
  • If x is fixed, the random component ? on the
    right-hand side of the model determines the
    properties of Y.

11
Empirical models (Cont.)
  • Suppose that the mean and variance of ? are 0 and
    ?2, respectively.
  • Thus, the true regression model
  • ?Yx ?0 ?1 x
  • is a line of mean values that is, the height of
    the regression line at any value of x is just the
    expected value of Y for that x.
  • The slope, can be interpreted as the change in
    the mean of Y for a unit change in x. The
    variability of Y at a particular value of x is
    determined by the error variance ?2.
  • This implies that there is a distribution of
    Y-values at each x and that the variance of this
    distribution is the same at each x.

12
(No Transcript)
13
Empirical models (Cont.)
  • In most real-world problems, the values of
  • The intercept and slope (?0, ?1)
  • The error variance ?2
  • will not be known, and they must be estimated
    from sample data.
  • Then this fitted regression equation or model is
    typically used in prediction of future
    observations of Y, or for estimating the mean
    response at a particular level of x.

14
(No Transcript)
15
Simple linear regression
  • The case of simple linear regression considers a
    single regressor or predictor x and a dependent
    or response variable Y.
  • Suppose that the true relationship between Y and
    x is a straight line and that the observation Y
    at each level of x is a random variable.
  • Gauss proposed estimating the parameters ?0 and
    ?1 to minimize the sum of the squares of the
    vertical deviations.

16
(No Transcript)
17
Simple linear regression (Cont.)
  • (1) Estimating the Intercept and slope
  • The least square estimate of the intercept and
    slope in the simple linear regression model are
  • Where

18
Simple linear regression (Cont.)
  • The fitted or estimated regression line is
    therefore
  • Note that each pair of observation satisfies the
    relationship
  • is called the residual.
    It describes the error in the fit of the model to
    the ith observation yi.

19
Simple linear regression (Cont.)
  • It is convenient to express as
  • Example 11-1.

20
(No Transcript)
21
Simple linear regression (Cont.)
  • (2) Estimating ?2 (Variance of the error term)
  • Error sum of squares of the response variable y
  • This can be calculated using SSE SST -
    Sxy
  • Where SST (Total sum of squares of the response
    variable y) can be calculated from
  • An unbiased estimator of ?2 is

22
(No Transcript)
23
Adequacy of the regression model
  • Fitting a regression model requires several
    assumptions
  • Estimation of the model parameters requires the
    assumption that the errors are uncorrelated
    random variables with mean zero and constant
    variance.
  • Tests of hypotheses and interval estimation
    require that the errors be normally distributed.
  • In addition, we assume that the order of the
    model is correct that is, if we fit a simple
    linear regression model, we are assuming that the
    phenomenon actually behaves in a linear or
    first-order manner.

24
Adequacy of the regression model (Cont.)
  • (1) Residual analysis
  • Analysis of the residuals is frequently helpful
    in
  • Checking the assumption that the errors are
    approximately normally distributed with constant
    variance.
  • Determining whether additional terms in the model
    would be useful.
  • As an approximate check of normality, the
    experimenter can construct a normal probability
    plot of residuals.

25
(No Transcript)
26
Adequacy of the regression model (Cont.)
  • Probability plotting is a graphical method for
    determining whether sample data conform to a
    hypothesized distribution based on a subjective
    visual examination of the data.
  • Normal probability plots are more used because
    many statistical techniques are appropriate only
    when the population is (at least approximately)
    normal.
  • If the hypothesized distribution adequately
    describes the data, the plotted points will fall
    approximately along a straight line if the
    plotted points deviate significantly from a
    straight line, the hypothesized model is not
    appropriate.

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Adequacy of the regression model (Cont.)
  • It is frequently helpful to plot the residuals
  • (1) In time sequence (if known),
  • (2) Against the
  • (3) Against the independent variable x.
  • If the residuals appear as in (b), the variance
    of the observations may be increasing with time
    or with the magnitude of yi or xi.
  • Plots of residuals against yi and xi that look
    like (c) indicate inequality of variance.

33
(No Transcript)
34
Adequacy of the regression model (Cont.)
  • Widely used variance-stabilizing transformations
    include the use of , ln y, or 1/y as the
    response.
  • Residual plots that look like (d) indicate model
    inadequacy that is, higher order terms should be
    added to the model, a transformation on the
    x-variable or the y-variable (or both) should be
    considered, or other regressors should be
    considered.
  • Example 11-7

35
(No Transcript)
36
Adequacy of the regression model (Cont.)
  • (2) Coefficient of determination R2
  • It is often used to judge the adequacy of a
    regression model.
  • 0 ? R2 ? 1
  • From example 11-1 R2 0.877 that is, the model
    accounts for 87.7 of the variability in the
    data.

37
Adequacy of the regression model (Cont.)
  • R2 does not measure the magnitude of the slope of
    the regression line.
  • R2 does not measure the appropriateness of the
    model, since it can be artificially inflated by
    adding higher order polynomial terms in x to the
    model. Even if y and x are related in a
    nonlinear fashion, R2 will often be large.
  • Even though R2 is large, this does not
    necessarily imply that the regression model will
    provide accurate predictions of future
    observations.

38
(No Transcript)
39
Significance of regression
  • An important part of assessing the adequacy of a
    linear regression model is testing statistical
    hypotheses about the model parameters and
    constructing certain confidence intervals.
    Hypothesis.

40
The hypothesis H0 ?1 0 is accepted There is no
linear relationship between x and Y.
41
The hypothesis H0 ?1 0 is rejected. (1) Either
that the straight-line model is adequate or (2)
Although there is a linear effect of x, better
results could be obtained with the addition of
higher order polynomial terms in x
42
(No Transcript)
43
Analysis of Variance Approach
  • The analysis of variance identity is as follows
  • The two components on the right-hand-side
    measure
  • The amount of variability in yi accounted for by
    the regression line
  • The residual variation left unexplained by the
    regression line.

44
Analysis of Variance Approach (Cont.)
  • SST SSR SSE
  • Total corrected sum of squares Regression sum
    of squares Error sum of squares.
  • SSE SST SSR SST - Sxy

45
Analysis of Variance Approach (Cont.)
  • If the null hypothesis H0 ?1 0 is true, the
    statistic
  • follows the F1,n-2 distribution, and we would
    reject H0 if f0 gt f?,1,n-2.
  • MSR and MSE are called Mean Squares (In general,
    a mean square is always computed by dividing a
    sum of squares by its number of degrees of
    freedom).

46
Analysis of Variance Approach (Cont.)
  • Example 11-3.

47
The F distribution
  • Suppose that two independent normal populations
    are of interest.
  • The random variable F is defined by the ratio of
    the two independent chi-square random variables
    (W Y), each divided by its number of degrees of
    freedom (u ? ).
  • Then the ratio
  • is said to follow the F distribution with u
    degrees of freedom in the numerator and ? degrees
    of freedom in the denominator.
  • It is usually abbreviated as Fu,?.

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
ANNOUNCEMENTS
  • Homework X
  • 11-1, 11-2, 11-3, 11-5
Write a Comment
User Comments (0)
About PowerShow.com