Scatterplots, Correlation, and Simple Regression - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Scatterplots, Correlation, and Simple Regression

Description:

The one time we observe. x, it is likely to be close to. the mean of its ... term is the only difference between y ... to the model would make r2 go up. ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 52

Provided by: DAN3179

Category:

more less

Transcript and Presenter's Notes

Title: Scatterplots, Correlation, and Simple Regression

1
Scatterplots, Correlation, and Simple Regression

Lecture 2
POLS 7014

2
Scatterplot

Is a pictorial depiction of the relationship
between variables.
Is a two-dimensional surface on which all the X
and Y scores of all the objects in your study are
represented with each objects value on X and
value on Y appearing as a single point.

3
Two Interests

Slope Magnitude of relationship between the
independent variable and the dependent variable
(how much change in one yields how much change in
the other).
Correlation the predictive power of one variable
on another. This is a Measure of Association
(but not a PRE Measure of Association)

4
Correlation
5
Correlation analysis asks

How good a predictor is the independent variable
of the dependent variable?
How good a predictor of income is education?
How accurate is our prediction of the effect of
education on income?
How close are the Dots to the line
It is a Measure of Association

6
Computing the Correlation Coefficient, Pearsons r

Assess how much X and Y move together
(covariance) out of the amount they move
individually (variance)

7
Computing r in STATA

Corr var1 var2 var100
Output looks like this

. corr unemplyd mdnincm flood age65 black
(obs427) unemplyd mdnincm flood
age65 black ------------------------------
--------------------------- unemplyd 1.0000
mdnincm -0.4960 1.0000 flood 0.0827
0.0083 1.0000 age65 0.0319 -0.1634
-0.0272 1.0000 black 0.5037 -0.3065
0.0703 -0.1038 1.0000
8
Slope of the Regression Line

We want to know the magnitude of the relationship
How much change in y is generated by a 1-unit
change in x ?

9
Where do we Draw the Line?
10
Minimize the sum of the distances between the
points and the line
-.25
2
2
-3.5
-.25
Problem They all add up to zero! Solution
Square the Distances -or- take the absolute value
11
Which Is Better?

Ordinary Least Square (OLS) regression mimimizes
Desirable Mathematical Properties (continuity)
Least Absolute Deviation (LAD or LAV) regression
Resistant to outliers
We will return to this question, but the answer
in most cases will be OLS, so well start there.

12
What do we want Regression to Do?

Regression Summarizes the relationship between
two interval-level variables
We want explain y
We could just use the mean of income.
y 52.43
The mean just tells us on average, what a
variable does

13
Expected Value

We might be able to explain y better by using a
different approach.
We think the value of y we observe is
conditional on the value of x.
Take the mean of y at each value of x
We essentially have a frequency distribution for
the values y can take on for each value of x

14
Problem

Even with many cases, we often do not observe
multiple cases at the same value of x. No average.

15
Solution

Rather than use an observed frequency
distribution, we can treat the values y can take
on as a random variable.
We get a probability distribution for the values
y can possibly take on for a given value of x
We can find the mean of that probability
distribution (the expected value of yi )

16
E(Y xi)
The one time we observe x, it is likely to be
close to the mean of its probability distribution
17
The Moral of the Story

The regression equation gives us a line of
predicted values
When we interpret regression, it only makes sense
to interpret our predictions as E(yxi)

18
So how do we get this equation

Recall we said we wanted to minimize the distance
between the points and the line

-.25
2
2
-3.5
-.25
19
How do we Symbolize this?
-.25
2
2
-3.5
-.25

We want the line that minimizes the sum of
squared errors. Formally.

20
How do we get this minimum?
There is a mathematical procedure that finds
the respective values of a and b that minimizes
21
The procedure is thePartial Derivative

If you know calculus, this makes sense
If you dont, just know that like subtraction
gives differences/distances, and addition gives
sums, partial derivatives help us find minimum
and maximum values.
Just like we can say that x1 x2 gives us the
distance between two points (no matter what x1
and x2 are), we can use the partial derivative
procedure to find the values of a and b that
minimize even if we
dont know x and y
Those values of a and b that minimize this
function (equal to the squared errors) are the
regression slope and intercept

22
Here we go
Subtract y hat from both sides to define
difference between y and y hat
Substitute eq. 2 in for y hat
The a bx terms cancel out and we confirm
that the error term is the only difference
between y and y hat
We are interested in minimizing the squared
errors
23
Now to minimize it for a
Begin with our equation
Treat b as a constant and take partial derivative
with respect to a
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by -1 and 2
Distribute the summation sign
Pull b constant out of summation
Solve for y by adding other terms to both sides
24
Now we minimize it for b
Begin with our equation
Treat a as a constant and take partial
derivative with respect to b
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by 2
Distribute the -x
Distribute the Summation Sign
Pull Constants from Summation Terms
Solve for xy
25
Normal Equations

These have yielded the normal equations
For a
For b
By solving the normal equations for a and b, we
get the least squares coefficients a and b

26
For a
Start with normal eq.
Subtract bx term
Divide by n Does anything look familiar?
Once we know b, a will be easy to get
27
Start with Normal Eq. for b
Substitute equation in for a
Modify equation a
Distribute sigma x term into a equation
Multiply everything by n
Subtract sigma y sigma x term from both sides
Remove b from right hand side (reverse
distribution)
Use division to solve for b
28
With a little more manipulation
Recall that we found
These are the least squares estimators for a and
b Lets give these equations a shot.
29
(No Transcript)
30
54.73 4.19
13.06
31
Calculating a and b

b 54.73 / 4.19 b
13.06
a y - bx
a 235.7 b (11.2)
235.7 13.06(11.2)
235.7 146.27
89.43

32
Remember e is the difference
between y and y hat. It is
not directly in the equation
33
Now, how to do this in STATA

Type regress robberies poverty

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- robberies Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- poverty 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
34
What does the line mean?

The regression line gives our predicted value of
y for every value of x
Or,
We can get STATA to give us a predicted value of
y for every value of x we observe.
Its easy, after your regression, just type
predict newvar
If we type predict yhat after running the
regression of poverty on robberies, we get a new
variable named yhat that contains the predicted
value of y for each observation

35
year poverty robberies yhat
1 1984 11.6 205 240.9243
2 1985 11.4 209 238.3118
3 1986 10.9 225 231.7808
4 1987 10.7 213 229.1684
5 1988 10.4 221 225.2498
6 1989 10.3 233 223.9436
7 1990 10.7 257 229.1684
8 1991 11.5 273 239.6181
9 1992 11.9 264 244.8429
10 1993 12.3 256 250.0677
11 1994 11.6 238 240.9243
36

Then we can create a scatterplot that shows both
the observed values and the predicted values (aka
fitted values)

Type scatter yhat robberies poverty, c(l)
37
Getting Probabilities by Hand

For 1984 , X 11.6 the regression equation is
Y 89.43 (13.0611.6)
Y 89.43 150.8 241.6
For 1993, X 12.3 ( of fam. Below poverty).
What value would we predict for Y (Robberies per
100,000)?
Y 89.43 (13.06 12.3)
Y 89.43 160.6 250.07

38
What you should know and be able to do

Explain the least squares principle
Calculate a and b by hand
Calculate predicted values by hand
Use STATAs -regress- command to estimate a and b
Use STATAs predict- command to get predicted
values for each observed value of y
Graph the predicted values using
scatter-

39
We have the best fitting line

Given certain assumptions which we will discuss
shortly, the Ordinary Least Squares Estimator is
the Best Linear Unbiased Estimator
OLS is BLUE
It may be the best fitting line, but is it a good
fitting line?
How close are the dots to the line?
We need a measure of association (like
correlation)

40
How Good does it get?
In Both instances, the regression equation is the
same. How are these results different from each
other?
41
Two ways to analyze fit

Root Mean Square Error (aka RMSE or Standard
Error of the Regression)
Proportional Reduction of Error (aka PRE or r2
Both are based on Se2

42
Terminology

We may call Total Sum of
Squares
(TSS)
We call Residual Sum of
Squares
(RSS)
We call TSS-RSS Regression Sum of Squares
(RegSS)

43
Root Mean Square Error

We want to answer the question, On average, how
far is each observed y from the regression line
Similar to question answered by measures of
dispersion, On average, how far is each observed
y from the mean of y
We essentially compute a standard deviation for
how far observed points are from the line

44
RMSE tells us

In units of the Dependent Variable, how much
dispersion is there around the regression line
Advantages
In units of the D.V.
Doesnt inflate with added variables
Compares within models with same D.V.
Disadvantage
No standard scale for what is good fit
Cant compare across different D.V.s

45
Proportional Reduction of Error (P.R.E.)

Before we knew the regression relationship, our
best prediction of y was the mean of y
If you run a regression with no independent
variable, you get a constant which is equal to
the mean.
How big are the residuals using the mean alone?
Compare that with residuals using regression

46
P.R.E. Proportional Reduction of Error

The error we make in prediction knowing nothing
about the independent variable is the distance
between each observed value and the mean
.
The error we make in prediction knowing the
independent variable is the distance between each
observed value and the regression line (its
predicted value) .

47
Also Notable about r2

r is the correlation coefficient
r2 is a measure of association for regression.
Gives the proportional reduction of error by
using the regression
r2 explains some percentage of the variation in
y.
In the bi-variate regression context,
(and vice-versa)

48
Final notes on r2

Adding useless variables to the model would make
r2 go up. It makes it look like adding more and
more variables makes the model better and better
For this reason r2 is not the favorite measure of
fit
An alternative, adjusted r2 accounts for the
number of variables in the model and adjusts
your r2 accordingly.

49
Calculating RMSE and r2

You can calculate and by
hand
Weve already done before
To get you just calculate each
predicted value of y by hand, and then subtract
Then plug and chug
You can get the predicted values of y using
STATAs predict command. You can get the
difference between y and the mean of y in STATA
too.

50
Calculating RMSE and r2

You can get the working parts from the STATA
output.

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
RegSS
RSS
TSS
51
Calculating RMSE and r2

Or, STATA just gives them to you

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------

Write a Comment

User Comments (0)