Title: Correlation
1The correlation coefficient, r, tells us about
strength (scatter) and direction of the linear
relationship between two quantitative variables.
In addition, we would like to have a numerical
description ( model ) of how both variables vary
together. For instance, is one variable
increasing faster than the other one? And we
would like to make predictions based on that
numerical description. The relationship above
looks linear . . .
But which line best describes our data?
2The regression line
- The least-squares regression line is the unique
line such that the sum of the squares of the
vertical distances of the data points to the line
is the smallest possible.
3And these equations are available in R through
the function lm(yx) ("lm" means "linear
model"). Try lm on the manatee data (manatee.csv)
4The equation completely describes the regression
line. To plot the regression line you only need
to choose two x values, put them into the
prediction equation, calculate y, and draw the
line that goes through those two points... or let
R do it for you with the abline function
(abline(lm(yx))) Hint The regression line
always passes through the mean of x and y.
The points you use for drawing the regression
line are computed from the equation.
.125450-41.4 14.85 .125700-41.4 46.1 So
plot the points (450,14.85) (700,46.1)
X
X
5The distinction between explanatory and response
variables is crucial in regression. If you
exchange y for x in calculating the regression
line, you will get a different line. Regression
examines the distance of all points from the line
in the y direction only.
6 (in 1000s)
There is a positive linear relationship between
the number of powerboats registered and the
number of manatee deaths.
The least squares regression line has the
equation
Thus if we were to limit the number of powerboat
registrations to 500,000, what could we expect
for the number of manatee deaths?
Roughly 21 manatees - do this with R using the
predict function (see help(predict))
7- The least-squares regression line of y on x is
the line that minimizes the sum of the squares of
the vertical distances of the data points to the
line. - The equation of the l-s line is usually
represented as b0 b1 x where - the predicted value of y
- b0 the intercept (predicted value of y when
x0) - b1 the slope of the prediction line
- The correlation coefficient, r, is related to the
l-s regression line as follows the square of r
(r2) is equal to the fraction of the variation in
the values of the response variable y that is
explained by the least squares regression of y on
x. (See next slide)
8Here are two plots of height (response) against
age (explanatory) of some children. Notice how
r2 relates to the variation in heights...
r0.994, r-square0.988
r0.921, r-square0.848
9- Homework
- Read pages 8-10 in the Reading Problems 2.1 on
Linear Regression - note the R functions used here
- model1lm(yx)
- plot(x,y) abline(model1)
- plot(model1)
- coef(model1) resid(model1) fitted(model1)
- plot(fitted(model1),resid(model1))
- Read at least one of the online sources for
simple linear regression ( I like the second
one) - http//www.stat.yale.edu/Courses/1997-98/101/linre
g.htm - http//www.statisticalpractice.com/
- http//onlinestatbook.com/rvls/
- http//www.sportsci.org/resource/stats/index.html
10- Homework(cont.)
- FPG (mg/ml) - fasting plasma glucose (measured at
home) HbA ( - measured in doctor's office). Can
you predict FPG by HbA? Plot, compute the
correlation coefficient, compute and plot the
regression line and get a residual plot. Are
there any unusual cases? Influential Points?
Outliers?