Regression%20and%20correlation%20methods - PowerPoint PPT Presentation

About This Presentation
Title:

Regression%20and%20correlation%20methods

Description:

Chapter 11 Regression and correlation methods Goals To relate (associate) a continuous random variable, preferably normally distributed, to other variables ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 62
Provided by: ABD139
Learn more at: https://sites.pitt.edu
Category:

less

Transcript and Presenter's Notes

Title: Regression%20and%20correlation%20methods


1
Chapter 11
  • Regression and correlation methods

2
(No Transcript)
3
Goals
  • To relate (associate) a continuous random
    variable, preferably normally distributed, to
    other variables

4
Terminology
  • Dependent Variable (Y)
  • The variable which is supposed to depend on
    others e.g., Birthweight
  • Independent variable, explanatory variable or
    predictors (x)
  • The variables which are used to predict the
    dependent variable, or explains the variation in
    the dependent variable, e.g., estriol levels

5
Assumptions
  • Dependent Variable
  • Continuous, preferably normally distributed
  • Have a linear association with the predictors
  • Independent variable
  • Fixed (not random)

6
Simple Linear Regression Model
  • Assume Y be the dependent variable and x be the
    lone covariate. Then a linear regression assumes
    that the true relationship between Y and x is
    given by
  • E(Yx) a ßx (1)

7
Simple Linear Regression Model
  • (1) can be written as
  • Y a ßx e, (2)
  • where
  • e is an error term with mean 0 and variance s2.

8
e
e
9
Implication
  • If there was a perfect linear relationship, every
    subject with the same value of x would have a
    common value of Y.
  • Deterministic relationship
  • The error term takes into account the
    inter-patient variability.
  • s2 Var(Y) Var(e).

10
Parameters
  • a is the intercept of the line.
  • ß is the slope of the line, referred to as
    regression coefficient
  • ß lt 0 indicates a negative linear association
    (the higher the x, the smaller the Y)
  • ß 0, no linear relationship.
  • ß gt 0 indicates a positive linear association
    (the higher the x, the larger the Y)
  • ß is the amount of change in Y for a unit change
    in x.

11
Data
Estriol (mg/24hr) Birthweight(g/100)
x17 y125
x29 y225
x39 y325
x412 y427
. .
. .
. .
12
Goal
  • How to estimate a, ß, and s2?
  • Fitting Regression Lines
  • How to draw inference? The relationship we see
    is it just due to chance?
  • Inference about regression parameters

13
Fitting Regression Line
  • Least Square method

14
Least square method
  • Idea
  • Estimate a and ß in a way that the observations
    are closest to the line
  • Impossible
  • Implement
  • Estimate a and ß in a way that the sum of squared
    deviations is minimized.

15
Least square method
  • Minimize
  • S(yi - a ßxi)2

Least square estimate of a
a (Syi bSxi)/n
Sxiyi Sxi S yi/n
Least square estimate of ß
b
Sxi2 (Sxi)2/n
Estimated Regression line y a bx
16
Example 11.3
  • Estimate the regression line for the birthweight
    data in Table 11.1, i.e.
  • Estimate the intercept a and slope b
  • We do the following calculations (see the
    corresponding Excel file)

17
Regression analysis for the data in Table 11.1
  • Sum of products 17500 (1)
  • Sum of X 534 (2)
  • Sum of Y 992 (3)
  • Sum of squared x 9876 (4)
  • Corrected Sum of products (1) - (2)(3)/n
    Lxy412 (5)
  • Corrected Sum of products (4) - (2)(2)/n
    Lxx677.4194 (6)
  • Regression coefficient (5)/(6) bLxy/Lxx0.6
    0819 (7)
  • Intercept (3) - (7)(2)/n a21.52343
  • Estimated Regression Line Birthweight (g/100)
    21.52 0.61 Estriol (mg/24hr)

18
Regression Analysis Interpretation
  • There is a positive association (statistically
    significant or not, we will test later) between
    birthweight and estriol levels.
  • For each mg increase in estriol level, the
    birthweight of the newborn is increased by 61 g.

19
Prediction
  • The predicted value of Y for a given value of x
    is

20
Prediction
  • What is the estimated (predicted) birthweight if
    a pregnant women has an estriol level of 15
    mg/24hr?

30.65 (g/100) 3065 g
21
Calibration
  • If low birthweight is defined as lt 2500, for
    what estriol level would the newborn be low
    birthweight?
  • That is to what value of estriol level does the
    predicted birthweight of 2500 correspond to?

22
Calibration
Women having estriol level of 5.72 or lower are
expected to have low birthweight newborns
23
Goodness of fit of a regression line
  • How good is x in predicting Y?

Estriol (mg/24hr) Birthweight (g/100) Predicted Birthweight (g/100) Residual
x17 y125 25.78 r1-0.78
x29 y225 26.99 r2-1.99
x39 y325 26.99 r3-1.99
x412 y427 28.82 r4-1.82
. . .
. . .
. . .
24
Goodness of fit of a regression line
  • Residual sum of squares (Res SS)

Summary Measure of Distance Between the Observed
and Predicted The smaller the Res. SS, the
better the regression line is in predicting Y
25
Total variation in observed Y
  • Total sum of squares

Summary Measure of Variation in Y
26
Total variation in predicted Y
  • Total sum of squares

Summary Measure of Variation in predicted Y
27
Goodness of fit of a regression line
28
Goodness of fit of a regression line
  • It can be shown that
  • The smaller the residual SS, the closer the total
    and regression sum of squares are, the better the
    regression is

29
Coefficient of determination R2
R2 is the proportion of total variation in Y
explained by the regression on x. R2 lies
between 0 and 1. R2 1 implies a perfect fit
(all the points are on the line).
30
F-test
  • Another way of formally looking at how good the
    regression of Y on x is, is through F-test.
  • The F-test compares Reg. SS to Residual SS
  • Larger F indicates Better Regression Fit

31
F-test
  • Test
  • Test statistic
  • Reject H0 if F gt F1,n-2,1-a

32
Summary of Goodness of regression fit
  • We need to compute three quantities
  • Total SS
  • Reg. SS
  • Res. Ss
  • Total SS Lyy
  • Reg. SS bLxy
  • Res. SS Total SS Reg.SS

33
Example 11.12
  • Total SS 674
  • Reg. SS 250.57
  • R2 0.37 gt 37 of the variation in birthweight
    is explained by the regression on estriol level
  • F 17.16
  • p-value P(F1,29 gt 17.16) 0.0003
  • H0 is rejected gt The slope of the regression
    line is significantly different from zero,
    implying a statistically significant linear
    relationship between estriol level and
    birthweight

34
T-test
  • Same hypothesis can be tested using a t-test.

35
T-test
36
T-test
P-value 2 Pr(tn-2 gt t)
100(1-a) CI for ß
37
Example 11.12
  • Is the regression coefficient (slope) for the
    estriol level significantly different from
    zero?
  • S2 14.6 s 3.82
  • SE(b) 0.15 t 4.14
  • p 0.00027123
  • 95 CI for reg coeff (0.31, 0.91)
  • H0 ß 0 is rejected gt The slope of the
    regression line is significantly different from
    zero, implying a statistically significant linear
    relationship between estriol level and
    birthweight

38
Correlation
  • Correlation refers to a quantitative measure of
    the strength of linear relationship between two
    variables
  • Regression, on the other hand is used for
    prediction
  • No distinction between dependent and independent
    variable is made when assessing the correlation

39
Correlation Example 11.14
40
Correlation
41
Correlation coefficient
  • Population correlation coefficient (See section
    5.4.2 in my notes)
  • If X and Y could be measured on everyone in the
    population, we could have calculated ?.

42
Interpretation of ?
  • ? lies between -1 and 1,
  • ? 0 implies no linear relationship,
  • ? -1 implies perfect negative linear
    relationship,
  • ? 1 implies perfect positive linear
    relationship.

43
Sample correlation coefficient
  • Unfortunately, we cannot measure X and Y on
    everyone in the population.
  • We estimate ? from the sample data as follows

44
Interpretation of r
  • r lies between -1 and 1,
  • r 0 implies no linear relationship,
  • r -1 implies perfect negative linear
    relationship,
  • r 1 implies perfect positive linear
    relationship,
  • The closer r is to 1, the stronger the
    relationship is.

45
Sample correlation coefficient
r 1
46
Sample correlation coefficient
r -1
47
Sample correlation coefficient
r0
48
Sample correlation coefficient
r0.988
49
Sample correlation coefficient
r0.49
50
Sample correlation coefficient
r-0.37
51
Correlation Example 11.14
  • Sum of products 5156.2 (1)
  • Sum of X 1872 (2)
  • Sum of Y 32.3 (3)
  • Sum of squared X 294320 (4)
  • Sum of squared Y 93.11 (5)
  • Corrected Sum of products (1) - (2)(3)/n
    Lxy 117.4 (6)
  • Corrected Sum of squares of X (4) - (2)(2)/n
    Lxx 2288 (7)
  • Corrected Sum of squares of Y (5) - (3)(3)/n
    Lyy 6.17 (8)
  • Sample Correlation Coefficient (6)/sqrt(7)(8)
    r 0.988

52
Correlation Example 11.14
  • Since r 0.988 , there exists nearly perfect
    positive correlation between mean FEV and the
    height. The taller a person is the higher the FEV
    levels.
  • Had we done a regression of one of the variables
    (FEV or height) on the other, the R2 would have
    been R2 r2 0.97698. This implies that 98
    of the variation in one variable is explained by
    the other.

53
Correlation Example 11.24
  • The sample correlation coefficient between
    estriol levels and the birth weights is
    calculated as r 0.61, implying moderately
    strong positive linear relationship. The higher
    the estriol levels, the higher the birth weights.
  • Remember, R2 0.37 (slide 33) which is equal
    to r2 (0.61)2.

54
Statistical Significance of Correlation
  • If r is close to 1, such as 0.988, one would
    believe that there is a strong linear
    relationship between the two variables. That
    means, there is no reason to believe that this
    strong association just happened by chance
    (sampling/observation).

55
Statistical Significance of Correlation
  • But If r 0.23, what conclusion would you draw
    about the relationship? Is it possible that in
    truth there was no correlation (? 0), but the
    sample by chance only shows that there is some
    sort of correlation between the two variables?

56
Significance test for correlation coefficient
  • Test the hypothesis H0 ? 0 vs. Ha
    ? ? 0.
  • Under the assumption that both variables are
    normally distributed,
  • Calculate two-sided p-value from a t distribution
    with (n-2) d.f.

57
Correlation Example 11.24
  • The sample correlation coefficient between
    estriol levels and the birth weights is
    calculated as r 0.61.
  • Is the correlation significant? (Is the
    correlation coefficient significantly different
    from zero?)

58
Correlation Example 11.24
  • Since p-value is very small, we reject the null
    hypothesis.
  • The correlation is statistically significant at a
    0.0003. gt We have enough evidence to conclude
    that the correlation coefficient is significantly
    different from zero.
  • Did you notice that the t-statistic (t 4.14)
    and p-value (0.00027) for testing H0 ? 0 are
    exactly same as the t-statistic calculated for
    H0 ß 0 in slide 37?

59
Significance test for correlation coefficient
  • Test the hypothesis H0 ? ?0 vs. Ha
    ? ? ?0.
  • Let (Fishers Z transformation),

60
Significance test for correlation coefficient
  • Then under H0,
  • The p-value for the test could then be calculated
    from a standard normal distribution
  • We will mainly use this result to find confidence
    intervals for ?

61
Confidence Interval for ?
Write a Comment
User Comments (0)
About PowerShow.com