Regression - PowerPoint PPT Presentation

About This Presentation



First, calculations involving X: SX = 74 (SX)2 = 5476 SX2 = 922. Then, analogous calculations involving Y: SY = 82 (SY)2 = 6724 SY2 = 1076 ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 35
Provided by: patric53


Transcript and Presenter's Notes

Title: Regression

Regression Correlation (1)
  1. A relationship between 2 variables X and Y
  2. The relationship seen as a straight line
  3. Two problems
  4. How can we tell if our regression line is useful?
  5. Test of hypothesis about the slope, ß1
  6. Correlation
  7. Useful features of r
  8. Test of hypothesis about ?
  9. Examples

A relationship between two variables X Y
  • We often have pairs of scores for a given set of
    cases. For example, we might have
  • of years of education and annual income, or
  • IQ and GPA
  • income and of books in the household
  • More generally, we have any X and Y, and our
    question is, does knowing something about X tell
    us anything about Y?

A relationship between two variables X Y
  • Does knowing something about X tell us anything
    about Y?
  • For example, knowing how many years of education
    a person has, could you usefully estimate their
    annual income, or the number of cigarettes they
    smoke in a year?

A relationship between two variables X Y
  • Often, the answer to that question is, Yes
    there is a relationship between the X and Y
    scores you have measured.
  • On average, as number of years of education goes
    up (across a set of people), number of cigarettes
    smoked per year goes down.

A relationship between two variables X Y
  • In the graph on the next slide, we see two
  • X goes down as Y goes up.
  • At each value of X, there is some variability in
    Y but substantially less than there is in Y

Note that the range of the Y values for this
value of X is small, compared to the whole range
of Y in the data set.
Y Cigarettes per year
X Years of education
The relationship seen as a straight line
  • The relationship between an X and a Y can be
    described using the equation for a straight line.
  • Y ß0 ß1X e
  • Y-intercept Slope Error
  • Note this is the (theoretical) population
    equation relating Y to X

Two problems
  • Y ß0 ß1X e
  • In principle, this equation would let us predict
    the value of Y for a given X without error IF
  • A. X were the only variable that influenced Y
  • Usually, it isnt
  • B. We knew the population values of ß0 ß1
  • Usually, we dont

Two problems
  • Be sure to distinguish between
  • Actual values of Y in the population.
  • Values of Y we would predict using
  • Y ß0 ß1X e
  • if we had the population values for ß0 ß1.
  • C. Values of Y we predict on the basis of the X-Y
    relationship in our sample data
  • Y ß0 ß1X

Why no e here?
Two problems
  • When we predict Y on the basis of X for a given
    case, two things can cause the predicted values
    to be different from the values we would find if
    we actually measured Y for that case
  • 1. We dont know the population values of ß0 and
    ß1 only the sample values ß0 and ß1.
  • Note that if we did know ß0 ß1, this source of
    error would disappear.

Two problems
  • 2. In the population, Y is not uniquely
    determined by X. As a result, for each value of
    X, there is a distribution of Y values.
  • relative to our predicted Y for a given value of
    X, the observed values of Y will sometimes be
    higher and sometimes be lower.
  • these errors are random over the long term,
    they will cancel each other out
  • but even if we knew ß0 and ß1, this source of
    error would still exist.

Two problems
  • In other words
  • We dont have population values for the slope and
    the intercept of the line relating X to Y. Thats
    one problem.
  • Even if we had population values for the slope
    and the intercept, the equation relating X to Y
    would still not perfectly predict Y. Thats the
    other problem.

How can we tell if our regression line is useful?
  • The line is useful if the predicted values of Y
    are close to the observed values of Y (in the
  • We use our sample X and Y values to compute the
    regression line, Y ß0 ß1X.
  • We then use this line to predict the same Y
    values, and compare our predicted values with the
    observed values in the sample data. If the
    prediction is good, we can then use the
    regression line to predict Y for values of X not
    in our sample.

How can we tell if our regression line is useful?

  • (Yi Yi) Yi (ß0 ß1Xi) (since Yi ß0
  • Therefore, the sum of the squared deviations of
    predicted Y values from actual Y values is
  • SSE SYi (ß0 ß1Xi)2
  • Now ß0 and ß1 are the least squares estimators
    of ß0 ß1 giving smaller SSE than any other
    values of ß0 and ß1 would.

When there is no relation between X and Y, the
best estimator of the Y value for any case is the
mean, Y.
Notice that the slope of this line is zero!
How can we tell if our regression line is useful?
  • If X is completely unrelated to Y, the best
    estimate we could make of Y would be the mean, Y,
    for any value of X.
  • We find out whether our regression line is useful
    by asking whether its slope is different from 0.
  • H0 ß1 0

Why not ß1?
How can we tell if our regression line is useful?
  • To test that null hypothesis, we use the fact
    that ß1 is one slope taken from the sampling
    distribution of ß1.
  • ß1 SSXY ß0 Y - ß1X
  • SSXX
  • Where SSXY S(Xi X) (Yi Y) SXiYi SXi SYi
  • n

How can we tell if our regression line is useful?
  • SSXX S(Xi X)2 SX2 (SX)2
  • n
  • (n sample size)
  • For the sampling distribution of ß1
  • The mean ß1 ?ß1 ?
  • vSSXX

How can we tell if our regression line is useful?
  • We estimate ?ß1 by sß1 s
  • vSSXX
  • Where s SSE
  • n-2

Test of hypothesis about the slope, ß1
  • Since ? is unknown, we use t to test H0
  • H0 ß1 0 H0 ß1 0
  • HA ß1 lt 0 HA ß1 ? 0
  • or ß1 gt 0
  • Test statistic t ß1 0
  • Sß1

Test of hypothesis about the slope, ß1
  • Rejection region
  • tobt lt t? tobt gt t?/2
  • tobt gt t?
  • tcrit is based on n-2 degrees of freedom.

  • The Pearson Correlation coefficient r is a
    numerical, descriptive measure of the strength
    and direction of relationship between two
    variables X and Y.
  • r SSXY
  • r gives much the same information as ß1. However
    r is scale-less and (-1 r 1)


Useful features of r
  • r indexes the X-Y relationship
  • r gt 0 means Y increases as X increases
  • r lt 0 means Y decreases as X increases
  • r 0 means there is no relationship between X
  • r is the sample correlation coefficient. We can
    use it to estimate rho (?), the population
    correlation coefficient, and use r to test H0 ?

Test of hypothesis about ?
  • H0 ? 0 H0 ? 0
  • HA ? lt 0 HA ? ? 0
  • or ? gt 0
  • Test statistic t r ?
  • 1 r2
  • n 2
  • tcrit has n-2 degrees of freedom.

Example 1
  • H0 ? 0
  • HA ? ? 0
  • Test statistic t r ?
  • 1 r2
  • n 2
  • tcrit t(5, a/2 .025) 2.571.

Example 1 Sum formulas
  • First, calculations involving X
  • SX 74 (SX)2 5476 SX2 922
  • Then, analogous calculations involving Y
  • SY 82 (SY)2 6724 SY2 1076
  • Then, calculations involving X and Y
  • SXY 976

Example 1 Sums of squares formulas
  • SSXY S(Xi X) (Yi Y) SXiYi SXi SYi
  • n
  • SSXX S(Xi X)2 SX2 (SX)2
  • n
  • SSYY S(Yi Y)2 SY2 (SY)2
  • n

Example 1 calculate r
  • SSXY 109.143
  • SSXX 139.71
  • SSYY 115.429
  • r SSXY r .859

Example 1 do t-test
  • t r ?
  • 1 r2
  • n 2
  • t .859 - 0 .859 3.751
  • 1 - .738 .229
  • 5
  • Reject H0 A significant correlation exists.

Example 2
Note these are the Greek letter rho, NOT the
English letter P
  • H0 ? 0
  • HA ? gt 0
  • Test statistic t r ?
  • 1 r2
  • n 2
  • tcrit t(7-2 5, a .05) 2.015

Example 2 Sum formulas
  • First, calculations involving X
  • SX 4.2 (SX)2 17.64 SX2 2.86
  • Then, analogous calculations involving Y
  • SY 32 (SY)2 1024 SY2 161.5
  • Then, calculations involving X and Y
  • SXY 21.35

Example 2 calculate r
  • SSXY 21.35 (4.2)(32) 2.15
  • 7
  • SSXX 2.86 17.64 .34
  • 7

Example 2 calculate r
  • SSYY 161.5 1024 15.2143
  • 7
  • r SSXY
  • r .945

Example 2 do t-test
  • t r ?
  • 1 r2
  • n 2
  • t .945 - 0 .945 6.48
  • 1 - .893 .146
  • 5
  • Reject H0 A significant correlation exists.

Write a Comment
User Comments (0)