Title: Simple Linear Regression and Correlation: Inferential Method
1Chapter 13
- Simple Linear Regression and Correlation
Inferential Method
213.1 Simple Linear Regression Model
- Deterministic Model and Probabilistic Model
- Deterministic Model The value of y is completely
determined by the value of an independent
variable x. - y f(x).
- Probabilistic Model The variables of interest (y
and x) are not deterministically related. The
equation of the additive probabilistic model is - y deterministic function of x random
deviation - f(x) e
3The Simple Linear Regression Model
- The simple linear regression model assumes that
there is a line with y-intercept a and slope ß,
called the true or population regression line.
When a value of the independent variable x is
fixed and an observation on the dependent
variable y is made, - y a ßx e
- Without the random deviation e, all observed (x,
y) points would fall exactly on the population
regression line. The inclusion of e in the model
equation recognizes that points will deviate from
the line.
4Two observations resulting from the simple linear
regression model
5Basic Assumptions of the Simple Linear Regression
Model
- The distribution of e at any particular x value
has mean value 0. That is µe 0. - The standard deviation of e (which describes the
spread of its distribution) is the same for any
particular value of x. This standard deviation is
denoted by s. - The distribution of e at any particular x value
is normal. - The random deviations e1, e2, , en associated
with different observations are independent of
one another.
6The distribution of y values in repeated sampling
- For any fixed x value, y itself has a normal
distribution, with - µy (mean y value for fixed x) (height of the
population regression line above x) a ßx
and - sy (standard deviation of y for a fixed x) s
- The slope ß of the population regression line is
the average change in y associated with a 1-unit
increase in x. - The y intercept a is the height of the population
line when x0. - The value of s determines the extent to which (x,
y) observations deviate from the population line
- When s is small, most observations are quite
close to the line, - When s is large, there are likely to be some
substantial deviations.
7Simple Linear Regression Model
8Example Stand on Your Head to Lose Weight?
- Amateur wrestlers who are overweight near the end
of the weight certification period, but just
barely so, have been known to stand on their
heads for a minute or two, get on their feet,
step back on the scale, and establish that they
are in the desired weight class. Using a head
stand as the method of last resort has become a
fairly common practice in amateur wrestling. Does
this really work? - Data were collected in an experiment where weight
loss was recorded for each wrestler after
exercising for 15 min and then doing a headstand
for 1 min 45 sec. - Based on the data, it was concluded that there
was in fact a demonstrable weight loss that was
greater than that for a control group that
exercised for 15 min but did not do the headstand.
Continued on next page
9Example Stand on Your Head to Lose Weight?
- Let y weight loss (in pounds) and x body
weight before exercise and headstand (in pounds). - The author concluded that a simple linear
regression model was a reasonable way to relate y
and x. Suppose the actual model equation has a
0, ß 0.001, and s 0.09. - If the distribution of the random errors at
any fixed weight (x value) is normal, then the
variable y weight loss is normally distributed
with - µy 0 0.001x, sy 0.09.
- What is the expected weight loss (in pounds) for
a 190-lb wrestler?
µy 0.001(190).19 lb
10Some commonly encountered patterns in scatterplots
In practice, the judgment of whether the simple
linear regression model is appropriate must based
on how the data were collected and on a
scatterplot of the data.
Figure (a) Pattern consistent with the simple
linear regression model Figure (b) pattern
consistent with a nonlinear probabilistic
model Figure (c) Pattern suggesting that
variability in y changes with x
11(No Transcript)
12Estimating the Population Regression Line
- The estimated regression line is just the
least-squares line - Let x denote a specified value of the predictor
variable x. then a bx has two different
interpretations - It is a point estimate of the mean y value when
xx, and - It is a point prediction of an individual y value
to be observed when xx.
13Example Mothers Age and Babys Birth Weight
- Medical researchers have noted that adolescent
females are much more likely to deliver
low-weight babies than are adult females. An
article gives the data in the table on the right
with - x maternal age (in years) and
- y birth weight of baby (in grams)
- (a) Find the equation of the estimated regression
line. - (b) Find the point estimate of the average birth
weight of babies born to 18-year-old mothers. - (c) Predict the birth weight of a baby to be born
to a particular 18-year-old mother.
14Solution to Example Mothers Age and Babys
Birth Weight
- (a) The equation of the estimated linear
regression line is the least squares line.
The least squares line is
Excel solution on next slides
15The estimated regression line is just the
least-squares line, and therefore, we can use
Excel (like in Chapter 5) to find the estimated
regression lineData ? Data Analysis ? Regression
16Input x and y ranges. (y range comes first in the
Regression dialog box.)
17In the Excel output, we find a and b in the
Coefficients column a -1163.45 and b
245.15. The estimated regression line is then
18(No Transcript)
19Some Remarks about se
- In simple linear regression, estimation of a and
ß results in a loss of 2 degrees of freedom,
leaving n - 2 as the number of degrees of freedom
for SSResid, se2 and se. - The coefficient of determination
-
- can be interpreted as the proportion of observed
y variation that can be explained by the model
relationship. - se is the magnitude of a typical sample
deviation (residual) from the least-squares line.
The smaller the value se, the closer the points
in the sample fall to the line and the better the
line does in predicting y from x.
20Example Woodpecker Hole Depth
- Woodpeckers are a valuable forest asset. An
article reported on a study of how woodpeckers
behaved when provided with polystyrene cylinders
as an alternative roost and nest cavity substrate
at different ambient temperature. (See data on
next slide.) - Let
- x ambient temperature (ºC) and
- y cavity depth (in centimeters)
- (a) Find the estimated linear regression line.
- (b) Does the model appear to be useful for
estimation and prediction?
The scatterplot shows a negative linear
relationship between x and y.
Solution From Excel output, the estimated
linear regression line is
Data on next slide and Excel output on the slide
after next
21Data for Example Woodpecker Hole Depth
22- r2 0.767 indicates that 76.7 of the observed
variation in cavity depth y can be attributed to
the probabilistic linear relationship with
ambient temperature. - The estimated standard deviation se 2.33 is the
magnitude of a typical sample deviation from the
least squares line, which is reasonably small
compared to y values. So the model appears to be
useful.
2313.2 Inference about ß (the slope of the
population regression line)
- The slope ß in the simple linear regression model
is the average or expected change in y associated
with a 1-unit increase in x. - The value of ß is almost always unknown, it has
to be estimated from the slope b of the
least-squares line. - The value of the statistic b may vary from sample
to sample, so how accurately does b estimate ß? - We need some facts about the sampling
distribution of b - Where is the curve centered relative to ß?
- How much does the curve spread out about its
center?
24Properties of the Sampling Distribution b
- When the four basic assumptions of the simple
linear regression model are satisfied, the
following conditions are met - The mean value of b is ß. That is µb ß, so the
sampling distribution of b is always centered at
the value of ß. - The standard deviation of the statistic b is
- The statistic b has a normal distribution (a
consequence of the model assumption that the
random deviation e is normally distributed.
25The estimated standard deviation of b
When the four basic assumptions of the simple
linear regression model are satisfied, the
probability distribution of the standardized
variable
is the t distribution with df n - 2.
26Confidence Interval for ß
- When the four basic assumptions of the simple
linear regression model are satisfied, a
confidence interval for ß, the slope of the
population regression line, has the form - b (t critical value) (sb)
- where the t critical value is based on df n -
2.
27Example Athletic Performance and Cardiovascular
Fitness
- Is cardiovascular fitness (as measured by time to
exhaustion from running on a treadmill) related
to an athletes performance in a 20-km ski race? - Let x treadmill time to exhaustion (in
minutes) and - y 20-km ski time (in minutes).
- Construct a 95 confidence interval for ß, the
slope of the population regression line. - Solution The slope ß is the average change in
ski time associated with 1-minute increase in
treadmill time. - Assumption The distribution of errors at
any given x is approximately normal. - A t critical value based on df n 2 11
2 9 is 2.26 from Appendix Table 3.
Continue on next slide
28From Excel output below b -2.3335 and sb
.591. The 95 confidence interval for ß is b
(t critical value) sb -2.3335
(2.26)(.591) -2.3335 1.336 (-3.671, -.999).
29Hypothesis Tests Concerning ß
- Null hypothesis H0 ß hypothesized value
- Test Statistic (The test is based on df n - 2.)
- Alternative Hypothesis P-Value
- Ha ß gt hypothesized value Area to the right
of the computed t under the appropriate t
curve - Ha ß lt hypothesized value Area to the left of
the computed t under the appropriate t curve - Ha ß ? hypothesized value 2 area to the
right of t if t gt 0, or - 2 area to the left of t if t lt 0
30Model Utility Test for Simple Linear Regression
- The model utility test for simple linear
regression is the test of - H0 ß 0 versus Ha ß ? 0
- The null hypothesis specifies that there is no
useful linear relationship between x and y,
whereas the alternative hypothesis specifies that
there is a useful linear relationship between x
and y. - If H0 is rejected, we conclude that the simple
linear regression model is useful for predicting
y. - The test procedure in the previous box (with
hypothesized value 0) is used to carry out the
model utility test in particular, the test
statistic is the t ratio
31Example University Graduation Rates
- The data on the right presents six-year
graduation rate (), student-related expenditure
per full-time student, and median SAT score for a
random sample of 15 primarily undergraduate
public universities and colleges in US with
enrollment between 10,000 and 20,000 students.
32- Part (a) of Example University Graduation Rates
- Is there a useful linear relation between
graduation rate (y) and median SAT score (x)? - Conduct a model utility test using a
.05. - Solution By the definition of slope, ß the
true average change in y (graduation rate)
associated with an increase of 1 point in x
(median SAT score). - H0 ß 0, Ha ß ? 0
- Significance level a 0.05.
- Assumption Assuming that the distribution of
errors at any given x value is approximately
normal, the assumptions of the simple linear
regression model are appropriate.
Excel output on next slide
33- Excel output b 0.132, a 91.31, r2 0.576,
sb 0.031,
34Solution to Part (a) of Example University
Graduation Rates
- Because r2 .576, about 56.7 of
observed variation in graduation rates can be
explained by the simple linear regression model.
(The correlation coefficient r 0.76.) It
appears that there is a useful linear relation
between x and y, but a confirmation requires a
formal model utility test. - H0 ß 0, Ha ß ? 0
- Significance level a 0.05
-
- P-value 2 (.001) .002 lt a (Ha ß ? 0
requires a two-tailed test.) - 5 Conclusion Since P-value lt a, we reject H0. We
conclude that there is a useful linear
relationship between graduation rate and median
SAT score.
35Exercise Part (b) of Example University
Graduation Rates University Graduation Rates
- Is there a useful linear relation between
graduation rate (y) and expenditure per full-time
student (x)? - Let ß the true average change in y
(graduation rate, in ) associated with an
increase of 1 in x (expenditure per full-time
student). - Conduct a model utility test using
a.05.
Answer P-value .092 gt a. We fail to reject H0
ß 0. There is no convincing evidence of a
linear relationship between graduation rate and
expenditure per full-time student.