Title: Scatterplots, Correlation, and Simple Regression
1Scatterplots, Correlation, and Simple Regression
2Scatterplot
- Is a pictorial depiction of the relationship
between variables. - Is a two-dimensional surface on which all the X
and Y scores of all the objects in your study are
represented with each objects value on X and
value on Y appearing as a single point.
3Two Interests
- Slope Magnitude of relationship between the
independent variable and the dependent variable
(how much change in one yields how much change in
the other). - Correlation the predictive power of one variable
on another. This is a Measure of Association
(but not a PRE Measure of Association)
4Correlation
5Correlation analysis asks
- How good a predictor is the independent variable
of the dependent variable? - How good a predictor of income is education?
- How accurate is our prediction of the effect of
education on income? - How close are the Dots to the line
- It is a Measure of Association
6Computing the Correlation Coefficient, Pearsons r
- Assess how much X and Y move together
(covariance) out of the amount they move
individually (variance)
7Computing r in STATA
- Corr var1 var2 var100
- Output looks like this
. corr unemplyd mdnincm flood age65 black
(obs427) unemplyd mdnincm flood
age65 black ------------------------------
--------------------------- unemplyd 1.0000
mdnincm -0.4960 1.0000 flood 0.0827
0.0083 1.0000 age65 0.0319 -0.1634
-0.0272 1.0000 black 0.5037 -0.3065
0.0703 -0.1038 1.0000
8Slope of the Regression Line
- We want to know the magnitude of the relationship
- How much change in y is generated by a 1-unit
change in x ?
9Where do we Draw the Line?
10Minimize the sum of the distances between the
points and the line
-.25
2
2
-3.5
-.25
Problem They all add up to zero! Solution
Square the Distances -or- take the absolute value
11Which Is Better?
- Ordinary Least Square (OLS) regression mimimizes
- Desirable Mathematical Properties (continuity)
- Least Absolute Deviation (LAD or LAV) regression
- Resistant to outliers
- We will return to this question, but the answer
in most cases will be OLS, so well start there.
12What do we want Regression to Do?
- Regression Summarizes the relationship between
two interval-level variables - We want explain y
- We could just use the mean of income.
- y 52.43
- The mean just tells us on average, what a
variable does
13Expected Value
- We might be able to explain y better by using a
different approach. - We think the value of y we observe is
conditional on the value of x. - Take the mean of y at each value of x
- We essentially have a frequency distribution for
the values y can take on for each value of x
14Problem
- Even with many cases, we often do not observe
multiple cases at the same value of x. No average.
15Solution
- Rather than use an observed frequency
distribution, we can treat the values y can take
on as a random variable. - We get a probability distribution for the values
y can possibly take on for a given value of x - We can find the mean of that probability
distribution (the expected value of yi )
16E(Y xi)
The one time we observe x, it is likely to be
close to the mean of its probability distribution
17The Moral of the Story
- The regression equation gives us a line of
predicted values - When we interpret regression, it only makes sense
to interpret our predictions as E(yxi)
18So how do we get this equation
- Recall we said we wanted to minimize the distance
between the points and the line
-.25
2
2
-3.5
-.25
19How do we Symbolize this?
-.25
2
2
-3.5
-.25
- We want the line that minimizes the sum of
squared errors. Formally.
20How do we get this minimum?
There is a mathematical procedure that finds
the respective values of a and b that minimizes
21The procedure is thePartial Derivative
- If you know calculus, this makes sense
- If you dont, just know that like subtraction
gives differences/distances, and addition gives
sums, partial derivatives help us find minimum
and maximum values. - Just like we can say that x1 x2 gives us the
distance between two points (no matter what x1
and x2 are), we can use the partial derivative
procedure to find the values of a and b that
minimize even if we
dont know x and y - Those values of a and b that minimize this
function (equal to the squared errors) are the
regression slope and intercept
22Here we go
Subtract y hat from both sides to define
difference between y and y hat
Substitute eq. 2 in for y hat
The a bx terms cancel out and we confirm
that the error term is the only difference
between y and y hat
We are interested in minimizing the squared
errors
23Now to minimize it for a
Begin with our equation
Treat b as a constant and take partial derivative
with respect to a
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by -1 and 2
Distribute the summation sign
Pull b constant out of summation
Solve for y by adding other terms to both sides
24Now we minimize it for b
Begin with our equation
Treat a as a constant and take partial
derivative with respect to b
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by 2
Distribute the -x
Distribute the Summation Sign
Pull Constants from Summation Terms
Solve for xy
25Normal Equations
- These have yielded the normal equations
- For a
- For b
- By solving the normal equations for a and b, we
get the least squares coefficients a and b
26For a
Start with normal eq.
Subtract bx term
Divide by n Does anything look familiar?
Once we know b, a will be easy to get
27Start with Normal Eq. for b
Substitute equation in for a
Modify equation a
Distribute sigma x term into a equation
Multiply everything by n
Subtract sigma y sigma x term from both sides
Remove b from right hand side (reverse
distribution)
Use division to solve for b
28With a little more manipulation
Recall that we found
These are the least squares estimators for a and
b Lets give these equations a shot.
29(No Transcript)
30 54.73 4.19
13.06
31Calculating a and b
- b 54.73 / 4.19 b
13.06 - a y - bx
- a 235.7 b (11.2)
- 235.7 13.06(11.2)
- 235.7 146.27
- 89.43
32Remember e is the difference
between y and y hat. It is
not directly in the equation
33Now, how to do this in STATA
- Type regress robberies poverty
. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- robberies Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- poverty 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
34What does the line mean?
- The regression line gives our predicted value of
y for every value of x - Or,
- We can get STATA to give us a predicted value of
y for every value of x we observe. - Its easy, after your regression, just type
- predict newvar
- If we type predict yhat after running the
regression of poverty on robberies, we get a new
variable named yhat that contains the predicted
value of y for each observation
35year poverty robberies yhat
1 1984 11.6 205 240.9243
2 1985 11.4 209 238.3118
3 1986 10.9 225 231.7808
4 1987 10.7 213 229.1684
5 1988 10.4 221 225.2498
6 1989 10.3 233 223.9436
7 1990 10.7 257 229.1684
8 1991 11.5 273 239.6181
9 1992 11.9 264 244.8429
10 1993 12.3 256 250.0677
11 1994 11.6 238 240.9243
36- Then we can create a scatterplot that shows both
the observed values and the predicted values (aka
fitted values)
Type scatter yhat robberies poverty, c(l)
37Getting Probabilities by Hand
- For 1984 , X 11.6 the regression equation is
- Y 89.43 (13.0611.6)
- Y 89.43 150.8 241.6
- For 1993, X 12.3 ( of fam. Below poverty).
What value would we predict for Y (Robberies per
100,000)? - Y 89.43 (13.06 12.3)
- Y 89.43 160.6 250.07
38What you should know and be able to do
- Explain the least squares principle
- Calculate a and b by hand
- Calculate predicted values by hand
- Use STATAs -regress- command to estimate a and b
- Use STATAs predict- command to get predicted
values for each observed value of y - Graph the predicted values using
- scatter-
39We have the best fitting line
- Given certain assumptions which we will discuss
shortly, the Ordinary Least Squares Estimator is
the Best Linear Unbiased Estimator - OLS is BLUE
- It may be the best fitting line, but is it a good
fitting line? - How close are the dots to the line?
- We need a measure of association (like
correlation)
40How Good does it get?
In Both instances, the regression equation is the
same. How are these results different from each
other?
41Two ways to analyze fit
- Root Mean Square Error (aka RMSE or Standard
Error of the Regression) - Proportional Reduction of Error (aka PRE or r2
- Both are based on Se2
42Terminology
- We may call Total Sum of
Squares -
(TSS) - We call Residual Sum of
Squares -
(RSS) - We call TSS-RSS Regression Sum of Squares
-
(RegSS)
43Root Mean Square Error
- We want to answer the question, On average, how
far is each observed y from the regression line - Similar to question answered by measures of
dispersion, On average, how far is each observed
y from the mean of y - We essentially compute a standard deviation for
how far observed points are from the line
44RMSE tells us
- In units of the Dependent Variable, how much
dispersion is there around the regression line - Advantages
- In units of the D.V.
- Doesnt inflate with added variables
- Compares within models with same D.V.
- Disadvantage
- No standard scale for what is good fit
- Cant compare across different D.V.s
45Proportional Reduction of Error (P.R.E.)
- Before we knew the regression relationship, our
best prediction of y was the mean of y - If you run a regression with no independent
variable, you get a constant which is equal to
the mean. - How big are the residuals using the mean alone?
- Compare that with residuals using regression
46P.R.E. Proportional Reduction of Error
- The error we make in prediction knowing nothing
about the independent variable is the distance
between each observed value and the mean
. - The error we make in prediction knowing the
independent variable is the distance between each
observed value and the regression line (its
predicted value) .
47Also Notable about r2
- r is the correlation coefficient
- r2 is a measure of association for regression.
Gives the proportional reduction of error by
using the regression - r2 explains some percentage of the variation in
y. - In the bi-variate regression context,
- (and vice-versa)
48Final notes on r2
- Adding useless variables to the model would make
r2 go up. It makes it look like adding more and
more variables makes the model better and better - For this reason r2 is not the favorite measure of
fit - An alternative, adjusted r2 accounts for the
number of variables in the model and adjusts
your r2 accordingly.
49Calculating RMSE and r2
- You can calculate and by
hand - Weve already done before
- To get you just calculate each
predicted value of y by hand, and then subtract - Then plug and chug
- You can get the predicted values of y using
STATAs predict command. You can get the
difference between y and the mean of y in STATA
too.
50Calculating RMSE and r2
- You can get the working parts from the STATA
output.
. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
RegSS
RSS
TSS
51Calculating RMSE and r2
- Or, STATA just gives them to you
. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------