Title: Linear Regression
1Linear Regression
- Least Squares Method
- the Meaning of r2
2We are given the following ordered pairs
(1.2,1), (1.3,1.6), (1.7,2.7), (2,2), (3,1.8),
(3,3), (3.8,3.3), (4,4.2). They are shown in the
scatterplot below
3Now we show the line . .
This is the mean of the y values.
4The line segments show the deviations from the
mean for each data point.
5The squares of the deviations are shown
geometrically. Squaring has the consequence of
making each difference positive. The greater the
variation in y, the larger are the squares. If
the y values are close together, the squares will
be small.
6This is the geometric representation of the sum
of the squares from the previous slide.
7Now the best fit line is shown.
8The directed distance is called the
residual. For each point this is the difference
between the actual y value and the predicted y
value. As with deviations, some residuals are
positive, some are negative. Together they add
to zero.
9This graph gives a geometric representation of
the squares of the residuals. As with the
squares of the deviations this produces all
positive quantities.
10This is the geometric representation of the sum
of the squares of the residuals. This quantity
is minimized in the least squares method of
linear regression. We use the line that produces
the smallest sum of the squares of the residuals.
11We now see both the squares of the deviations
from the mean (green squares) and squares of
residuals (red squares).
12This geometric representation of the sum of the
squares of the residuals (in red) shows that this
quantity is a portion of the total sum of the
squares of the deviations (imagine the entire
green square). All of the variation in y is
represented by the larger green square, and the
part that is not explainable by the regression
equation is in red.
The green square is called SST, the sum of the
squares, total, about the mean. The red square
is called SSE, the sum of the squares of the
error about the line.
SST
SSE
13If we now measure the quantities SSE and SST, we
can make a useful calculation, the coefficient of
determination, or r2. Recall that SST is the
total sum of the squares of the deviations about
the mean value of y, and SSE is the sum of the
squares of the error (residuals) about the line.
n.b. This r2 is the square of the correlation
coefficient.
14In our example, SST7.6 and SSE 2.44.
Therefore,
This means that 68 of the variation in y is
explained by the regression line. The meaning of
r2 is extremely important in Statistics.