Title: Regression%20and%20correlation%20methods
1Chapter 11
- Regression and correlation methods
2(No Transcript)
3Goals
- To relate (associate) a continuous random
variable, preferably normally distributed, to
other variables
4Terminology
- Dependent Variable (Y)
- The variable which is supposed to depend on
others e.g., Birthweight - Independent variable, explanatory variable or
predictors (x) - The variables which are used to predict the
dependent variable, or explains the variation in
the dependent variable, e.g., estriol levels
5Assumptions
- Dependent Variable
- Continuous, preferably normally distributed
- Have a linear association with the predictors
- Independent variable
- Fixed (not random)
6Simple Linear Regression Model
- Assume Y be the dependent variable and x be the
lone covariate. Then a linear regression assumes
that the true relationship between Y and x is
given by -
- E(Yx) a ßx (1)
7Simple Linear Regression Model
- (1) can be written as
-
- Y a ßx e, (2)
- where
- e is an error term with mean 0 and variance s2.
8e
e
9Implication
- If there was a perfect linear relationship, every
subject with the same value of x would have a
common value of Y. - Deterministic relationship
- The error term takes into account the
inter-patient variability. - s2 Var(Y) Var(e).
10Parameters
- a is the intercept of the line.
- ß is the slope of the line, referred to as
regression coefficient - ß lt 0 indicates a negative linear association
(the higher the x, the smaller the Y) - ß 0, no linear relationship.
- ß gt 0 indicates a positive linear association
(the higher the x, the larger the Y) - ß is the amount of change in Y for a unit change
in x.
11Data
Estriol (mg/24hr) Birthweight(g/100)
x17 y125
x29 y225
x39 y325
x412 y427
. .
. .
. .
12Goal
- How to estimate a, ß, and s2?
- Fitting Regression Lines
- How to draw inference? The relationship we see
is it just due to chance? - Inference about regression parameters
13Fitting Regression Line
14Least square method
- Idea
- Estimate a and ß in a way that the observations
are closest to the line - Impossible
- Implement
- Estimate a and ß in a way that the sum of squared
deviations is minimized.
15Least square method
Least square estimate of a
a (Syi bSxi)/n
Sxiyi Sxi S yi/n
Least square estimate of ß
b
Sxi2 (Sxi)2/n
Estimated Regression line y a bx
16Example 11.3
- Estimate the regression line for the birthweight
data in Table 11.1, i.e. - Estimate the intercept a and slope b
- We do the following calculations (see the
corresponding Excel file)
17Regression analysis for the data in Table 11.1
- Sum of products 17500 (1)
- Sum of X 534 (2)
- Sum of Y 992 (3)
- Sum of squared x 9876 (4)
- Corrected Sum of products (1) - (2)(3)/n
Lxy412 (5) - Corrected Sum of products (4) - (2)(2)/n
Lxx677.4194 (6) - Regression coefficient (5)/(6) bLxy/Lxx0.6
0819 (7) - Intercept (3) - (7)(2)/n a21.52343
- Estimated Regression Line Birthweight (g/100)
21.52 0.61 Estriol (mg/24hr)
18Regression Analysis Interpretation
- There is a positive association (statistically
significant or not, we will test later) between
birthweight and estriol levels. - For each mg increase in estriol level, the
birthweight of the newborn is increased by 61 g.
19Prediction
- The predicted value of Y for a given value of x
is
20Prediction
- What is the estimated (predicted) birthweight if
a pregnant women has an estriol level of 15
mg/24hr?
30.65 (g/100) 3065 g
21Calibration
- If low birthweight is defined as lt 2500, for
what estriol level would the newborn be low
birthweight? - That is to what value of estriol level does the
predicted birthweight of 2500 correspond to?
22Calibration
Women having estriol level of 5.72 or lower are
expected to have low birthweight newborns
23Goodness of fit of a regression line
- How good is x in predicting Y?
Estriol (mg/24hr) Birthweight (g/100) Predicted Birthweight (g/100) Residual
x17 y125 25.78 r1-0.78
x29 y225 26.99 r2-1.99
x39 y325 26.99 r3-1.99
x412 y427 28.82 r4-1.82
. . .
. . .
. . .
24Goodness of fit of a regression line
- Residual sum of squares (Res SS)
Summary Measure of Distance Between the Observed
and Predicted The smaller the Res. SS, the
better the regression line is in predicting Y
25Total variation in observed Y
Summary Measure of Variation in Y
26Total variation in predicted Y
Summary Measure of Variation in predicted Y
27Goodness of fit of a regression line
28Goodness of fit of a regression line
- It can be shown that
- The smaller the residual SS, the closer the total
and regression sum of squares are, the better the
regression is
29Coefficient of determination R2
R2 is the proportion of total variation in Y
explained by the regression on x. R2 lies
between 0 and 1. R2 1 implies a perfect fit
(all the points are on the line).
30F-test
- Another way of formally looking at how good the
regression of Y on x is, is through F-test. - The F-test compares Reg. SS to Residual SS
- Larger F indicates Better Regression Fit
31F-test
- Test
- Test statistic
- Reject H0 if F gt F1,n-2,1-a
32Summary of Goodness of regression fit
- We need to compute three quantities
- Total SS
- Reg. SS
- Res. Ss
- Total SS Lyy
- Reg. SS bLxy
- Res. SS Total SS Reg.SS
33Example 11.12
- Total SS 674
- Reg. SS 250.57
- R2 0.37 gt 37 of the variation in birthweight
is explained by the regression on estriol level - F 17.16
- p-value P(F1,29 gt 17.16) 0.0003
- H0 is rejected gt The slope of the regression
line is significantly different from zero,
implying a statistically significant linear
relationship between estriol level and
birthweight
34T-test
- Same hypothesis can be tested using a t-test.
35T-test
36T-test
P-value 2 Pr(tn-2 gt t)
100(1-a) CI for ß
37Example 11.12
- Is the regression coefficient (slope) for the
estriol level significantly different from
zero? - S2 14.6 s 3.82
- SE(b) 0.15 t 4.14
- p 0.00027123
- 95 CI for reg coeff (0.31, 0.91)
- H0 ß 0 is rejected gt The slope of the
regression line is significantly different from
zero, implying a statistically significant linear
relationship between estriol level and
birthweight
38Correlation
- Correlation refers to a quantitative measure of
the strength of linear relationship between two
variables - Regression, on the other hand is used for
prediction - No distinction between dependent and independent
variable is made when assessing the correlation
39Correlation Example 11.14
40Correlation
41Correlation coefficient
- Population correlation coefficient (See section
5.4.2 in my notes) - If X and Y could be measured on everyone in the
population, we could have calculated ?.
42Interpretation of ?
- ? lies between -1 and 1,
- ? 0 implies no linear relationship,
- ? -1 implies perfect negative linear
relationship, - ? 1 implies perfect positive linear
relationship.
43Sample correlation coefficient
- Unfortunately, we cannot measure X and Y on
everyone in the population. - We estimate ? from the sample data as follows
44Interpretation of r
- r lies between -1 and 1,
- r 0 implies no linear relationship,
- r -1 implies perfect negative linear
relationship, - r 1 implies perfect positive linear
relationship, - The closer r is to 1, the stronger the
relationship is.
45Sample correlation coefficient
r 1
46Sample correlation coefficient
r -1
47Sample correlation coefficient
r0
48Sample correlation coefficient
r0.988
49Sample correlation coefficient
r0.49
50Sample correlation coefficient
r-0.37
51Correlation Example 11.14
- Sum of products 5156.2 (1)
- Sum of X 1872 (2)
- Sum of Y 32.3 (3)
- Sum of squared X 294320 (4)
- Sum of squared Y 93.11 (5)
- Corrected Sum of products (1) - (2)(3)/n
Lxy 117.4 (6) - Corrected Sum of squares of X (4) - (2)(2)/n
Lxx 2288 (7) - Corrected Sum of squares of Y (5) - (3)(3)/n
Lyy 6.17 (8) - Sample Correlation Coefficient (6)/sqrt(7)(8)
r 0.988
52Correlation Example 11.14
- Since r 0.988 , there exists nearly perfect
positive correlation between mean FEV and the
height. The taller a person is the higher the FEV
levels. - Had we done a regression of one of the variables
(FEV or height) on the other, the R2 would have
been R2 r2 0.97698. This implies that 98
of the variation in one variable is explained by
the other.
53Correlation Example 11.24
- The sample correlation coefficient between
estriol levels and the birth weights is
calculated as r 0.61, implying moderately
strong positive linear relationship. The higher
the estriol levels, the higher the birth weights. - Remember, R2 0.37 (slide 33) which is equal
to r2 (0.61)2.
54Statistical Significance of Correlation
- If r is close to 1, such as 0.988, one would
believe that there is a strong linear
relationship between the two variables. That
means, there is no reason to believe that this
strong association just happened by chance
(sampling/observation).
55Statistical Significance of Correlation
- But If r 0.23, what conclusion would you draw
about the relationship? Is it possible that in
truth there was no correlation (? 0), but the
sample by chance only shows that there is some
sort of correlation between the two variables?
56Significance test for correlation coefficient
- Test the hypothesis H0 ? 0 vs. Ha
? ? 0. - Under the assumption that both variables are
normally distributed, - Calculate two-sided p-value from a t distribution
with (n-2) d.f.
57Correlation Example 11.24
- The sample correlation coefficient between
estriol levels and the birth weights is
calculated as r 0.61. - Is the correlation significant? (Is the
correlation coefficient significantly different
from zero?)
58Correlation Example 11.24
- Since p-value is very small, we reject the null
hypothesis. - The correlation is statistically significant at a
0.0003. gt We have enough evidence to conclude
that the correlation coefficient is significantly
different from zero. - Did you notice that the t-statistic (t 4.14)
and p-value (0.00027) for testing H0 ? 0 are
exactly same as the t-statistic calculated for
H0 ß 0 in slide 37?
59Significance test for correlation coefficient
- Test the hypothesis H0 ? ?0 vs. Ha
? ? ?0. - Let (Fishers Z transformation),
60Significance test for correlation coefficient
- Then under H0,
- The p-value for the test could then be calculated
from a standard normal distribution - We will mainly use this result to find confidence
intervals for ?
61Confidence Interval for ?