Scatterplots, Correlation, and Simple Regression - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Scatterplots, Correlation, and Simple Regression

Description:

The one time we observe. x, it is likely to be close to. the mean of its ... term is the only difference between y ... to the model would make r2 go up. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 52
Provided by: DAN3179
Category:

less

Transcript and Presenter's Notes

Title: Scatterplots, Correlation, and Simple Regression


1
Scatterplots, Correlation, and Simple Regression
  • Lecture 2
  • POLS 7014

2
Scatterplot
  • Is a pictorial depiction of the relationship
    between variables.
  • Is a two-dimensional surface on which all the X
    and Y scores of all the objects in your study are
    represented with each objects value on X and
    value on Y appearing as a single point.

3
Two Interests
  • Slope Magnitude of relationship between the
    independent variable and the dependent variable
    (how much change in one yields how much change in
    the other).
  • Correlation the predictive power of one variable
    on another. This is a Measure of Association
    (but not a PRE Measure of Association)

4
Correlation
5
Correlation analysis asks
  • How good a predictor is the independent variable
    of the dependent variable?
  • How good a predictor of income is education?
  • How accurate is our prediction of the effect of
    education on income?
  • How close are the Dots to the line
  • It is a Measure of Association

6
Computing the Correlation Coefficient, Pearsons r
  • Assess how much X and Y move together
    (covariance) out of the amount they move
    individually (variance)

7
Computing r in STATA
  • Corr var1 var2 var100
  • Output looks like this

. corr unemplyd mdnincm flood age65 black
(obs427) unemplyd mdnincm flood
age65 black ------------------------------
--------------------------- unemplyd 1.0000
mdnincm -0.4960 1.0000 flood 0.0827
0.0083 1.0000 age65 0.0319 -0.1634
-0.0272 1.0000 black 0.5037 -0.3065
0.0703 -0.1038 1.0000
8
Slope of the Regression Line
  • We want to know the magnitude of the relationship
  • How much change in y is generated by a 1-unit
    change in x ?

9
Where do we Draw the Line?
10
Minimize the sum of the distances between the
points and the line
-.25
2
2
-3.5
-.25
Problem They all add up to zero! Solution
Square the Distances -or- take the absolute value
11
Which Is Better?
  • Ordinary Least Square (OLS) regression mimimizes
  • Desirable Mathematical Properties (continuity)
  • Least Absolute Deviation (LAD or LAV) regression
  • Resistant to outliers
  • We will return to this question, but the answer
    in most cases will be OLS, so well start there.

12
What do we want Regression to Do?
  • Regression Summarizes the relationship between
    two interval-level variables
  • We want explain y
  • We could just use the mean of income.
  • y 52.43
  • The mean just tells us on average, what a
    variable does

13
Expected Value
  • We might be able to explain y better by using a
    different approach.
  • We think the value of y we observe is
    conditional on the value of x.
  • Take the mean of y at each value of x
  • We essentially have a frequency distribution for
    the values y can take on for each value of x

14
Problem
  • Even with many cases, we often do not observe
    multiple cases at the same value of x. No average.

15
Solution
  • Rather than use an observed frequency
    distribution, we can treat the values y can take
    on as a random variable.
  • We get a probability distribution for the values
    y can possibly take on for a given value of x
  • We can find the mean of that probability
    distribution (the expected value of yi )

16
E(Y xi)
The one time we observe x, it is likely to be
close to the mean of its probability distribution
17
The Moral of the Story
  • The regression equation gives us a line of
    predicted values
  • When we interpret regression, it only makes sense
    to interpret our predictions as E(yxi)

18
So how do we get this equation
  • Recall we said we wanted to minimize the distance
    between the points and the line

-.25
2
2
-3.5
-.25
19
How do we Symbolize this?
-.25
2
2
-3.5
-.25
  • We want the line that minimizes the sum of
    squared errors. Formally.

20
How do we get this minimum?
There is a mathematical procedure that finds
the respective values of a and b that minimizes
21
The procedure is thePartial Derivative
  • If you know calculus, this makes sense
  • If you dont, just know that like subtraction
    gives differences/distances, and addition gives
    sums, partial derivatives help us find minimum
    and maximum values.
  • Just like we can say that x1 x2 gives us the
    distance between two points (no matter what x1
    and x2 are), we can use the partial derivative
    procedure to find the values of a and b that
    minimize even if we
    dont know x and y
  • Those values of a and b that minimize this
    function (equal to the squared errors) are the
    regression slope and intercept

22
Here we go
Subtract y hat from both sides to define
difference between y and y hat
Substitute eq. 2 in for y hat
The a bx terms cancel out and we confirm
that the error term is the only difference
between y and y hat
We are interested in minimizing the squared
errors
23
Now to minimize it for a
Begin with our equation
Treat b as a constant and take partial derivative
with respect to a
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by -1 and 2
Distribute the summation sign
Pull b constant out of summation
Solve for y by adding other terms to both sides
24
Now we minimize it for b
Begin with our equation
Treat a as a constant and take partial
derivative with respect to b
Set derivative to 0 to find the minimum
Remove constants from summation term
Divide both sides by 2
Distribute the -x
Distribute the Summation Sign
Pull Constants from Summation Terms
Solve for xy
25
Normal Equations
  • These have yielded the normal equations
  • For a
  • For b
  • By solving the normal equations for a and b, we
    get the least squares coefficients a and b

26
For a
Start with normal eq.
Subtract bx term
Divide by n Does anything look familiar?
Once we know b, a will be easy to get
27
Start with Normal Eq. for b
Substitute equation in for a
Modify equation a
Distribute sigma x term into a equation
Multiply everything by n
Subtract sigma y sigma x term from both sides
Remove b from right hand side (reverse
distribution)
Use division to solve for b
28
With a little more manipulation
Recall that we found
These are the least squares estimators for a and
b Lets give these equations a shot.
29
(No Transcript)
30
54.73 4.19
13.06
31
Calculating a and b
  • b 54.73 / 4.19 b
    13.06
  • a y - bx
  • a 235.7 b (11.2)
  • 235.7 13.06(11.2)
  • 235.7 146.27
  • 89.43

32
Remember e is the difference
between y and y hat. It is
not directly in the equation
33
Now, how to do this in STATA
  • Type regress robberies poverty

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- robberies Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- poverty 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
34
What does the line mean?
  • The regression line gives our predicted value of
    y for every value of x
  • Or,
  • We can get STATA to give us a predicted value of
    y for every value of x we observe.
  • Its easy, after your regression, just type
  • predict newvar
  • If we type predict yhat after running the
    regression of poverty on robberies, we get a new
    variable named yhat that contains the predicted
    value of y for each observation

35
year poverty robberies yhat
1 1984 11.6 205 240.9243
2 1985 11.4 209 238.3118
3 1986 10.9 225 231.7808
4 1987 10.7 213 229.1684
5 1988 10.4 221 225.2498
6 1989 10.3 233 223.9436
7 1990 10.7 257 229.1684
8 1991 11.5 273 239.6181
9 1992 11.9 264 244.8429
10 1993 12.3 256 250.0677
11 1994 11.6 238 240.9243
36
  • Then we can create a scatterplot that shows both
    the observed values and the predicted values (aka
    fitted values)

Type scatter yhat robberies poverty, c(l)
37
Getting Probabilities by Hand
  • For 1984 , X 11.6 the regression equation is
  • Y 89.43 (13.0611.6)
  • Y 89.43 150.8 241.6
  • For 1993, X 12.3 ( of fam. Below poverty).
    What value would we predict for Y (Robberies per
    100,000)?
  • Y 89.43 (13.06 12.3)
  • Y 89.43 160.6 250.07

38
What you should know and be able to do
  • Explain the least squares principle
  • Calculate a and b by hand
  • Calculate predicted values by hand
  • Use STATAs -regress- command to estimate a and b
  • Use STATAs predict- command to get predicted
    values for each observed value of y
  • Graph the predicted values using
  • scatter-

39
We have the best fitting line
  • Given certain assumptions which we will discuss
    shortly, the Ordinary Least Squares Estimator is
    the Best Linear Unbiased Estimator
  • OLS is BLUE
  • It may be the best fitting line, but is it a good
    fitting line?
  • How close are the dots to the line?
  • We need a measure of association (like
    correlation)

40
How Good does it get?
In Both instances, the regression equation is the
same. How are these results different from each
other?
41
Two ways to analyze fit
  • Root Mean Square Error (aka RMSE or Standard
    Error of the Regression)
  • Proportional Reduction of Error (aka PRE or r2
  • Both are based on Se2

42
Terminology
  • We may call Total Sum of
    Squares

  • (TSS)
  • We call Residual Sum of
    Squares

  • (RSS)
  • We call TSS-RSS Regression Sum of Squares

  • (RegSS)

43
Root Mean Square Error
  • We want to answer the question, On average, how
    far is each observed y from the regression line
  • Similar to question answered by measures of
    dispersion, On average, how far is each observed
    y from the mean of y
  • We essentially compute a standard deviation for
    how far observed points are from the line

44
RMSE tells us
  • In units of the Dependent Variable, how much
    dispersion is there around the regression line
  • Advantages
  • In units of the D.V.
  • Doesnt inflate with added variables
  • Compares within models with same D.V.
  • Disadvantage
  • No standard scale for what is good fit
  • Cant compare across different D.V.s

45
Proportional Reduction of Error (P.R.E.)
  • Before we knew the regression relationship, our
    best prediction of y was the mean of y
  • If you run a regression with no independent
    variable, you get a constant which is equal to
    the mean.
  • How big are the residuals using the mean alone?
  • Compare that with residuals using regression

46
P.R.E. Proportional Reduction of Error
  • The error we make in prediction knowing nothing
    about the independent variable is the distance
    between each observed value and the mean
    .
  • The error we make in prediction knowing the
    independent variable is the distance between each
    observed value and the regression line (its
    predicted value) .

47
Also Notable about r2
  • r is the correlation coefficient
  • r2 is a measure of association for regression.
    Gives the proportional reduction of error by
    using the regression
  • r2 explains some percentage of the variation in
    y.
  • In the bi-variate regression context,
  • (and vice-versa)

48
Final notes on r2
  • Adding useless variables to the model would make
    r2 go up. It makes it look like adding more and
    more variables makes the model better and better
  • For this reason r2 is not the favorite measure of
    fit
  • An alternative, adjusted r2 accounts for the
    number of variables in the model and adjusts
    your r2 accordingly.

49
Calculating RMSE and r2
  • You can calculate and by
    hand
  • Weve already done before
  • To get you just calculate each
    predicted value of y by hand, and then subtract
  • Then plug and chug
  • You can get the predicted values of y using
    STATAs predict command. You can get the
    difference between y and the mean of y in STATA
    too.

50
Calculating RMSE and r2
  • You can get the working parts from the STATA
    output.

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
RegSS
RSS
TSS
51
Calculating RMSE and r2
  • Or, STATA just gives them to you

. regress robberies poverty Source SS
df MS Number of obs
11 ------------------------------------- F(
1, 9) 1.32 Model 714.732341 1
714.732341 Prob gt F 0.2795 Residual
4856.90402 9 539.656002 R-squared
0.1283 -------------------------------------
Adj R-squared 0.0314 Total 5571.63636
10 557.163636 Root MSE
23.23 -------------------------------------------
---------------------- poverty Coef. Std.
Err. t Pgtt 95 Con. Interval ----------
--------------------------------------------------
---- robberies 13.06206 11.35007 1.15 0.279
-12.61358 38.73771 _cons 89.40434
127.4166 0.70 0.501 -198.8321
377.6408 -----------------------------------------
------------------------
Write a Comment
User Comments (0)
About PowerShow.com