Title: Chapter 11 Simple linear regression and correlation
1 Chapter 11Simple linear regression and
correlation
2Empirical models
- Many problems in engineering and science involve
exploring the relationships between two or more
variables. Regression analysis is a statistical
technique that is very useful for these types of
problems. - For example, in a chemical process, suppose that
the yield of the product is related to the
process-operating temperature. Regression
analysis can be used to build a model to predict
yield at a given temperature level. This model
can also be used for process optimization, such
as finding the level of temperature that
maximizes yield, or for process control purposes.
3Empirical models (Cont.)
- Table 11-1 y is the purity of oxygen produced
in a chemical distillation process, and x is the
percentage of hydrocarbons that are present in
the main condenser the distillation unit of the
data in Table 11-1.
4(No Transcript)
5(No Transcript)
6Empirical models (Cont.)
- Although no simple curve will pass exactly
through all the points, there is a strong
indication that the points lie scattered randomly
around a straight line. - It is probably reasonable to assume that the mean
of the random variable Y is related to x by the
following straight-line relationship - E(Yx) ?Yx ?0 ?1 x
- Where slope and intercept are called regression
coefficients.
7Empirical models (Cont.)
- Generalization can be done to a probabilistic
linear model by assuming that - The expected value of Y is a linear function of x
- For a fixed value of x the actual value of Y is
determined by the mean value function (the linear
model) plus a random error - Y ?0 ?1 x ?
- where ? is the random error term.
- We will call this model the simple linear
regression model, because it has only one
independent variable or regressor.
8Empirical models (Cont.)
- Sometimes a model will arise from a theoretical
relationship. - At other times, we will have no theoretical
knowledge of the relationship between x and y,
and the choice of the model is based on
inspection of a scatter diagram. We then think
of the regression model as an empirical model.
9(No Transcript)
10Empirical models (Cont.)
- Suppose that we can fix the value of x and
observe the value of the random variable Y. - If x is fixed, the random component ? on the
right-hand side of the model determines the
properties of Y.
11Empirical models (Cont.)
- Suppose that the mean and variance of ? are 0 and
?2, respectively. - Thus, the true regression model
- ?Yx ?0 ?1 x
- is a line of mean values that is, the height of
the regression line at any value of x is just the
expected value of Y for that x. - The slope, can be interpreted as the change in
the mean of Y for a unit change in x. The
variability of Y at a particular value of x is
determined by the error variance ?2. - This implies that there is a distribution of
Y-values at each x and that the variance of this
distribution is the same at each x.
12(No Transcript)
13Empirical models (Cont.)
- In most real-world problems, the values of
- The intercept and slope (?0, ?1)
- The error variance ?2
- will not be known, and they must be estimated
from sample data. - Then this fitted regression equation or model is
typically used in prediction of future
observations of Y, or for estimating the mean
response at a particular level of x.
14(No Transcript)
15Simple linear regression
- The case of simple linear regression considers a
single regressor or predictor x and a dependent
or response variable Y. - Suppose that the true relationship between Y and
x is a straight line and that the observation Y
at each level of x is a random variable. - Gauss proposed estimating the parameters ?0 and
?1 to minimize the sum of the squares of the
vertical deviations.
16(No Transcript)
17Simple linear regression (Cont.)
- (1) Estimating the Intercept and slope
- The least square estimate of the intercept and
slope in the simple linear regression model are - Where
18Simple linear regression (Cont.)
- The fitted or estimated regression line is
therefore - Note that each pair of observation satisfies the
relationship - is called the residual.
It describes the error in the fit of the model to
the ith observation yi.
19Simple linear regression (Cont.)
- It is convenient to express as
- Example 11-1.
20(No Transcript)
21Simple linear regression (Cont.)
- (2) Estimating ?2 (Variance of the error term)
- Error sum of squares of the response variable y
- This can be calculated using SSE SST -
Sxy - Where SST (Total sum of squares of the response
variable y) can be calculated from - An unbiased estimator of ?2 is
22(No Transcript)
23Adequacy of the regression model
- Fitting a regression model requires several
assumptions - Estimation of the model parameters requires the
assumption that the errors are uncorrelated
random variables with mean zero and constant
variance. - Tests of hypotheses and interval estimation
require that the errors be normally distributed. - In addition, we assume that the order of the
model is correct that is, if we fit a simple
linear regression model, we are assuming that the
phenomenon actually behaves in a linear or
first-order manner.
24Adequacy of the regression model (Cont.)
- (1) Residual analysis
- Analysis of the residuals is frequently helpful
in - Checking the assumption that the errors are
approximately normally distributed with constant
variance. - Determining whether additional terms in the model
would be useful. - As an approximate check of normality, the
experimenter can construct a normal probability
plot of residuals.
25(No Transcript)
26Adequacy of the regression model (Cont.)
- Probability plotting is a graphical method for
determining whether sample data conform to a
hypothesized distribution based on a subjective
visual examination of the data. - Normal probability plots are more used because
many statistical techniques are appropriate only
when the population is (at least approximately)
normal. - If the hypothesized distribution adequately
describes the data, the plotted points will fall
approximately along a straight line if the
plotted points deviate significantly from a
straight line, the hypothesized model is not
appropriate.
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32Adequacy of the regression model (Cont.)
- It is frequently helpful to plot the residuals
- (1) In time sequence (if known),
- (2) Against the
- (3) Against the independent variable x.
- If the residuals appear as in (b), the variance
of the observations may be increasing with time
or with the magnitude of yi or xi. - Plots of residuals against yi and xi that look
like (c) indicate inequality of variance.
33(No Transcript)
34Adequacy of the regression model (Cont.)
- Widely used variance-stabilizing transformations
include the use of , ln y, or 1/y as the
response. - Residual plots that look like (d) indicate model
inadequacy that is, higher order terms should be
added to the model, a transformation on the
x-variable or the y-variable (or both) should be
considered, or other regressors should be
considered. - Example 11-7
35(No Transcript)
36Adequacy of the regression model (Cont.)
- (2) Coefficient of determination R2
- It is often used to judge the adequacy of a
regression model. - 0 ? R2 ? 1
- From example 11-1 R2 0.877 that is, the model
accounts for 87.7 of the variability in the
data.
37Adequacy of the regression model (Cont.)
- R2 does not measure the magnitude of the slope of
the regression line. - R2 does not measure the appropriateness of the
model, since it can be artificially inflated by
adding higher order polynomial terms in x to the
model. Even if y and x are related in a
nonlinear fashion, R2 will often be large. - Even though R2 is large, this does not
necessarily imply that the regression model will
provide accurate predictions of future
observations.
38(No Transcript)
39Significance of regression
- An important part of assessing the adequacy of a
linear regression model is testing statistical
hypotheses about the model parameters and
constructing certain confidence intervals.
Hypothesis.
40The hypothesis H0 ?1 0 is accepted There is no
linear relationship between x and Y.
41The hypothesis H0 ?1 0 is rejected. (1) Either
that the straight-line model is adequate or (2)
Although there is a linear effect of x, better
results could be obtained with the addition of
higher order polynomial terms in x
42(No Transcript)
43Analysis of Variance Approach
- The analysis of variance identity is as follows
- The two components on the right-hand-side
measure - The amount of variability in yi accounted for by
the regression line - The residual variation left unexplained by the
regression line.
44Analysis of Variance Approach (Cont.)
- SST SSR SSE
- Total corrected sum of squares Regression sum
of squares Error sum of squares. - SSE SST SSR SST - Sxy
45Analysis of Variance Approach (Cont.)
- If the null hypothesis H0 ?1 0 is true, the
statistic - follows the F1,n-2 distribution, and we would
reject H0 if f0 gt f?,1,n-2. - MSR and MSE are called Mean Squares (In general,
a mean square is always computed by dividing a
sum of squares by its number of degrees of
freedom).
46Analysis of Variance Approach (Cont.)
47The F distribution
- Suppose that two independent normal populations
are of interest. - The random variable F is defined by the ratio of
the two independent chi-square random variables
(W Y), each divided by its number of degrees of
freedom (u ? ). - Then the ratio
- is said to follow the F distribution with u
degrees of freedom in the numerator and ? degrees
of freedom in the denominator. - It is usually abbreviated as Fu,?.
48(No Transcript)
49(No Transcript)
50(No Transcript)
51ANNOUNCEMENTS
- Homework X
- 11-1, 11-2, 11-3, 11-5