Chapter 11 Simple linear regression and correlation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 11 Simple linear regression and correlation

1
Chapter 11Simple linear regression and
correlation
2
Empirical models

Many problems in engineering and science involve
exploring the relationships between two or more
variables. Regression analysis is a statistical
technique that is very useful for these types of
problems.
For example, in a chemical process, suppose that
the yield of the product is related to the
process-operating temperature. Regression
analysis can be used to build a model to predict
yield at a given temperature level. This model
can also be used for process optimization, such
as finding the level of temperature that
maximizes yield, or for process control purposes.

3
Empirical models (Cont.)

Table 11-1 y is the purity of oxygen produced
in a chemical distillation process, and x is the
percentage of hydrocarbons that are present in
the main condenser the distillation unit of the
data in Table 11-1.

4
(No Transcript)
5
(No Transcript)
6
Empirical models (Cont.)

Although no simple curve will pass exactly
through all the points, there is a strong
indication that the points lie scattered randomly
around a straight line.
It is probably reasonable to assume that the mean
of the random variable Y is related to x by the
following straight-line relationship
E(Yx) ?Yx ?0 ?1 x
Where slope and intercept are called regression
coefficients.

7
Empirical models (Cont.)

Generalization can be done to a probabilistic
linear model by assuming that
The expected value of Y is a linear function of x
For a fixed value of x the actual value of Y is
determined by the mean value function (the linear
model) plus a random error
Y ?0 ?1 x ?
where ? is the random error term.
We will call this model the simple linear
regression model, because it has only one
independent variable or regressor.

8
Empirical models (Cont.)

Sometimes a model will arise from a theoretical
relationship.
At other times, we will have no theoretical
knowledge of the relationship between x and y,
and the choice of the model is based on
inspection of a scatter diagram. We then think
of the regression model as an empirical model.

9
(No Transcript)
10
Empirical models (Cont.)

Suppose that we can fix the value of x and
observe the value of the random variable Y.
If x is fixed, the random component ? on the
right-hand side of the model determines the
properties of Y.

11
Empirical models (Cont.)

Suppose that the mean and variance of ? are 0 and
?2, respectively.
Thus, the true regression model
?Yx ?0 ?1 x
is a line of mean values that is, the height of
the regression line at any value of x is just the
expected value of Y for that x.
The slope, can be interpreted as the change in
the mean of Y for a unit change in x. The
variability of Y at a particular value of x is
determined by the error variance ?2.
This implies that there is a distribution of
Y-values at each x and that the variance of this
distribution is the same at each x.

12
(No Transcript)
13
Empirical models (Cont.)

In most real-world problems, the values of
The intercept and slope (?0, ?1)
The error variance ?2
will not be known, and they must be estimated
from sample data.
Then this fitted regression equation or model is
typically used in prediction of future
observations of Y, or for estimating the mean
response at a particular level of x.

14
(No Transcript)
15
Simple linear regression

The case of simple linear regression considers a
single regressor or predictor x and a dependent
or response variable Y.
Suppose that the true relationship between Y and
x is a straight line and that the observation Y
at each level of x is a random variable.
Gauss proposed estimating the parameters ?0 and
?1 to minimize the sum of the squares of the
vertical deviations.

16
(No Transcript)
17
Simple linear regression (Cont.)

(1) Estimating the Intercept and slope
The least square estimate of the intercept and
slope in the simple linear regression model are
Where

18
Simple linear regression (Cont.)

The fitted or estimated regression line is
therefore
Note that each pair of observation satisfies the
relationship
is called the residual.
It describes the error in the fit of the model to
the ith observation yi.

19
Simple linear regression (Cont.)

It is convenient to express as
Example 11-1.

20
(No Transcript)
21
Simple linear regression (Cont.)

(2) Estimating ?2 (Variance of the error term)
Error sum of squares of the response variable y
This can be calculated using SSE SST -
Sxy
Where SST (Total sum of squares of the response
variable y) can be calculated from
An unbiased estimator of ?2 is

22
(No Transcript)
23
Adequacy of the regression model

Fitting a regression model requires several
assumptions
Estimation of the model parameters requires the
assumption that the errors are uncorrelated
random variables with mean zero and constant
variance.
Tests of hypotheses and interval estimation
require that the errors be normally distributed.
In addition, we assume that the order of the
model is correct that is, if we fit a simple
linear regression model, we are assuming that the
phenomenon actually behaves in a linear or
first-order manner.

24
Adequacy of the regression model (Cont.)

(1) Residual analysis
Analysis of the residuals is frequently helpful
in
Checking the assumption that the errors are
approximately normally distributed with constant
variance.
Determining whether additional terms in the model
would be useful.
As an approximate check of normality, the
experimenter can construct a normal probability
plot of residuals.

25
(No Transcript)
26
Adequacy of the regression model (Cont.)

Probability plotting is a graphical method for
determining whether sample data conform to a
hypothesized distribution based on a subjective
visual examination of the data.
Normal probability plots are more used because
many statistical techniques are appropriate only
when the population is (at least approximately)
normal.
If the hypothesized distribution adequately
describes the data, the plotted points will fall
approximately along a straight line if the
plotted points deviate significantly from a
straight line, the hypothesized model is not
appropriate.

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Adequacy of the regression model (Cont.)

It is frequently helpful to plot the residuals
(1) In time sequence (if known),
(2) Against the
(3) Against the independent variable x.
If the residuals appear as in (b), the variance
of the observations may be increasing with time
or with the magnitude of yi or xi.
Plots of residuals against yi and xi that look
like (c) indicate inequality of variance.

33
(No Transcript)
34
Adequacy of the regression model (Cont.)

Widely used variance-stabilizing transformations
include the use of , ln y, or 1/y as the
response.
Residual plots that look like (d) indicate model
inadequacy that is, higher order terms should be
added to the model, a transformation on the
x-variable or the y-variable (or both) should be
considered, or other regressors should be
considered.
Example 11-7

35
(No Transcript)
36
Adequacy of the regression model (Cont.)

(2) Coefficient of determination R2
It is often used to judge the adequacy of a
regression model.
0 ? R2 ? 1
From example 11-1 R2 0.877 that is, the model
accounts for 87.7 of the variability in the
data.

37
Adequacy of the regression model (Cont.)

R2 does not measure the magnitude of the slope of
the regression line.
R2 does not measure the appropriateness of the
model, since it can be artificially inflated by
adding higher order polynomial terms in x to the
model. Even if y and x are related in a
nonlinear fashion, R2 will often be large.
Even though R2 is large, this does not
necessarily imply that the regression model will
provide accurate predictions of future
observations.

38
(No Transcript)
39
Significance of regression

An important part of assessing the adequacy of a
linear regression model is testing statistical
hypotheses about the model parameters and
constructing certain confidence intervals.
Hypothesis.

40
The hypothesis H0 ?1 0 is accepted There is no
linear relationship between x and Y.
41
The hypothesis H0 ?1 0 is rejected. (1) Either
that the straight-line model is adequate or (2)
Although there is a linear effect of x, better
results could be obtained with the addition of
higher order polynomial terms in x
42
(No Transcript)
43
Analysis of Variance Approach

The analysis of variance identity is as follows
The two components on the right-hand-side
measure
The amount of variability in yi accounted for by
the regression line
The residual variation left unexplained by the
regression line.

44
Analysis of Variance Approach (Cont.)

SST SSR SSE
Total corrected sum of squares Regression sum
of squares Error sum of squares.
SSE SST SSR SST - Sxy

45
Analysis of Variance Approach (Cont.)

If the null hypothesis H0 ?1 0 is true, the
statistic
follows the F1,n-2 distribution, and we would
reject H0 if f0 gt f?,1,n-2.
MSR and MSE are called Mean Squares (In general,
a mean square is always computed by dividing a
sum of squares by its number of degrees of
freedom).

46
Analysis of Variance Approach (Cont.)

Example 11-3.

47
The F distribution

Suppose that two independent normal populations
are of interest.
The random variable F is defined by the ratio of
the two independent chi-square random variables
(W Y), each divided by its number of degrees of
freedom (u ? ).
Then the ratio
is said to follow the F distribution with u
degrees of freedom in the numerator and ? degrees
of freedom in the denominator.
It is usually abbreviated as Fu,?.

Chapter 11 Simple linear regression and correlation PowerPoint PPT Presentation