Title: Describing the Relation
1Chapter 4 Describing the Relation Between Two
Variables
- When collecting data, sometimes we measure more
than one variable on each individual. Bivariate
data is data consisting of the values of two
different response variables that are obtained
from the same population element. There are
three possible combinations of variables that may
be measured. - Both variables are qualitative
- One variable is qualitative and the other is
quantitative - Both variables are quantitative
When two values are measured on each population
unit, we denote the data as ordered pairs (x, y).
In some examples we will see that x is the input
variable and y is the response variable.
2Both Variables are Qualitative We often arrange
the data in a cross-tabulation or contingency
table. These tables count all possible
combinations of the levels of the variables.
After counting the numbers, we make the tables
with percentages listed in each of the respective
categories. Below is an example of a contingency
table.
3One Qualitative Variable and One Quantitative
Variable In this case, we separate out our
results, and group them according to the
qualitative variable. So essentially we have
separate samples, labeled by the qualitative
variable. We can use this information to draw
dot-plots, boxplots, and compute five number
summaries, means, and standard deviations.
4Two Quantitative Variables In many statistical
problems we are given data that consist of a pair
with an input / explanatory / independent
x-variable and an output / response / dependent
y-variable. We often want to know if there is
a linear relationship between these variables.
Chapter 4 deals with this topic.
5Section 4.1 Scatter Diagrams Correlation
When we have two quantitative variables, we ask
ourselves if we are interested in using the value
of one variable to predict the value of the other
variable.
Suppose I am interested in knowing if the amount
of time studying for exam one is related to the
score I receive on exam one. Then the time spent
studying is the predictor variable and the score
on the exam is the response variable. The
response variable is the variable whose value can
be explained by, or determined by, the value of
the predictor variable.
6The first step in identifying if there is a
relationship between the variables is to draw a
picture. A scatter diagram is a graph that shows
the relationship between two quantitative
variables measured on the same individual. Each
individual in the data set is represented by a
point in the scatter diagram. The predictor
variable is plotted on the horizontal axis and
the response variable is plotted on the vertical
axis. Do not connect the points when drawing a
scatter diagram.
7The correlation coefficient is a measure of the
strength of linear relation between two
quantitative variables. We use ? (rho) to
represent the population correlation coefficient
and r to represent the sample correlation
coefficient. We will only discuss in detail the
sample correlation coefficient, r.
or equilvalently,
8(No Transcript)
9(No Transcript)
10Using the previous data set, we can check if
there is a linear relationship between x and
y. Calculate r.
2.8 7.2 16.8 18.4 23.4 68.6
4 16 64 64 81 229
1.96 3.24 4.41 5.29 6.76 21.66
Go to board
Total
31
10.2
11- Properties of the Correlation Coefficient
- The correlation coefficient is always between 1
and 1, inclusive. That is, -1 ? r ?
1. - If r 1, there is a perfect positive linear
relation between the two variables. - If r -1, there is a perfect negative linear
relation between the two variables. - The closer r is to 1, the stronger the evidence
is of positive association between the two
variables. - The closer r is to 1, the stronger the evidence
is of negative association between the two
variables. - If r is close to 0, there is evidence of no
linear relation between the two variables.
Because the correlation coefficient is a measure
of strength of linear relation, r close to 0 does
not imply no relation, just no linear relation. - The correlation coefficient is a unitless measure
of association. So the unit of measure for x and
y plays no role in the interpretation of r. - Linear correlation does not mean causation.
CORRELATION DOES NOT IMPLY CAUSATION! - r is sensitive to an extreme data point
(Recall Z scores)
Only designed experiments imply causation
12Lets take a look at some scatter diagrams and
see what we think about the linear association
between two variables.
r 1 Perfect positive correlation
13r 0.9 Strong positive correlation
14r 0.4 Weak positive correlation
15r -1 Perfect negative correlation
16r -0.9 Strong negative correlation
17r -0.4 Weak negative correlation
18r close to zero, no linear correlation
19r close to zero, no linear correlation
20Section 4.2 Least-Squares Regression
We have looked at scatter diagrams and
correlation. We now know how to find the
strength of the linear association between x and
y. If the data show a linear relationship
between x and y, can we find an equation to
represent this relationship? We want to find the
line that best describes the relation between
the two variables. What does best mean?
The line that best describes the relation
between two variables is the one that makes the
residuals/errors as small as possible.
The difference between the observed value of y
and the predicted value of y is the
residual/error.
? e Residual
21The most popular technique for making the
residuals as small as possible is the method of
least squares.
The Least Squares Regression Criterion The
least-squares regression line is the one that
minimizes the sum of the squared errors. It is
the line that minimizes the square of the
vertical distance between observed values of y
and those predicted by the line, .
22Example Set the cruise control on your car at 50
mph and let y distance and x time.
y mx b y b0 b1x
23Each individual point can be represented as y
?0 ?1x ? where ? represents the
error/residual.
Note Not every point falls on the line, ?
accounts for this
24Example Let x the number of cars that fit
into the garage Let y cost of the house (in
thousands of dollars) Find the least-squares
regression line for the following data.
25We know the line that minimizes the sum of the
squared residuals is
but how do we determine the values of b0 and b1.
Equations for b1 and b0
b1 is the slope of the least-squares regression
line. This value represents the expected
increase in y for a one unit increase in x.
b0 is the y-intercept of the least squares
regression line. This represents the expected
value of y when x equals zero.
26- b0 may or may not have a practical
interpretation. It will only have meaning if the
following two conditions are met - A value of zero for the predictor variable makes
sense. - There are observed values of the predictor
variable near zero.
Extrapolation vs. Interpolation The second
condition above is especially important because
statisticians do not use the regression model to
make predictions outside the scope of the model
(extrapolation). In other words, statisticians
do not recommend using the regression model to
make predictions for values of the predictor
variable that are much larger or much smaller
than those observed because we cannot be certain
of the behavior of the data for which we have no
observations. Interpolation is making predictions
within the scope of our data.
27Calculating b0 and b1 for the above example
0 140 180 440 760
0 1 1 4 6
n 4
Total
4
620
28Least squares regression line
Calculate each residual and the sum of the
squared residuals.
85 155 155 225
-5 -15 25 -5
25 225 625 25
Total
900
This is the smallest possible value for the SS of
the residuals.
29What if we tried another model?
Say,
80 155 155 230
0 -15 25 -10
0 225 625 100
Total
950
Least Squares Regression Line
85 155 155 225
-5 -15 25 -5
25 225 625 25
950 gt 900
Total
900
30Now that we have the best line in terms of
minimizing the sum of the squares of the errors,
we want to determine if this line is any good
at describing the relation between x and y. How
much of the variability in the output, y, can be
attributed to the input, x?
Section 4.3 Diagnostics on the Least-Squares Regr
ession Line
31The coefficient of determination, R2, measures
the percentage of total variation in the response
variable that is explained by the least-squares
regression line.
The coefficient of determination is a number
between 0 and 1. If R2 0, the least squares
regression line has no explanatory power. If R2
1, the least squares regression line explains
100 of the variation in the response variable.
32Deviations The deviation between the observed
value of the response variable, y, and the mean
value of the response variable, , is called
the total deviation, so total deviation
The deviation between the predicted value of the
response variable, , and the mean value of the
response variable, , is called the explained
deviation, so explained deviation
The deviation between the observed value of the
response variable, y, and the predicted value of
the response variable, , is called the
unexplained deviation, so unexplained
deviation
33Total Deviation Unexplained Deviation
Explained Deviation
Let us look at this in terms of our example.
34It is also true that, Total variation
Unexplained Variation Explained
Variation Note variation sum of
(deviations)2 In other words, SSTO SSE
SSReg
SSyy
35Note that,
36Knowing this information, we can find the value
of R2, the coefficient of determination in three
ways.
Caution Squaring the linear correlation
coefficient to obtain the coefficient of
determination works only for the least squares of
the simple linear regression model. It does not
work in general.
In the example above, how much of the variability
in cost of the house can be attributed to number
of cars that fit into the garage? Go to board.