Title: Regression: (1) Simple Linear Regression
1Regression(1) Simple Linear Regression
- Hal Whitehead
- BIOL4062 / 5062
2Regression
- Purposes of regression
- Simple linear regression
- Formula
- Assumptions
- If assumptions hold, what can we do?
- Testing assumptions
- When assumptions do not hold
3Regression
- One Dependent Variable Y
- Independent Variables X1,X2,X3,...
4Purposes of Regression
- 1. Relationship between Y and X's
- 2. Quantitative prediction of Y
- 3. Relationship between Y and X controlling for
C - 4. Which of X's are most important?
- 5. Best mathematical model
- 6. Compare regression relationships Y1 on X,
Y2 on X - 7. Assess interactive effects of X's
5- Simple regression one X
- Multiple regression two or more X's
6Simple linear regression
7Assumptions of simple linear regression
- 1. Existence
- 2. Independence
- 3. Linearity
- 4. Homoscedasticity
- 5. Normality
- 6. X measured without error
8Assumptions of simple linear regression
- 1. For any fixed value of X, Y is a random
variable with a certain probability distribution
having finite mean and variance - (Existence)
Y
Prob of Y
X
9Assumptions of simple linear regression
- 2. The Y values are statistically independent of
one another - (Independence)
10Assumptions of simple linear regression
- 3. The mean value of Y given X is a straight
line function of X - (Linearity)
Y
Prob of Y
X
11Assumptions of simple linear regression
- 4. The variance of Y is the same for all X
- (Homoscedasticity)
Y
Prob of Y
X
12Assumptions of simple linear regression
- 5. For any fixed value of X, Y has a normal
distribution - (Normality)
Y
Prob of Y
X
13Assumptions of simple linear regression
- 6. There are no measurement errors in X
- (X measured without error)
14Assumptions of simple linear regression
- 1. Existence
- 2. Independence
- 3. Linearity
- 4. Homoscedasticity
- 5. Normality
- 6. X measured without error
15If assumptions hold, what can we do?
- 1. Estimate ß0 (intercept), ß1 (slope), together
with measures of uncertainty - 2. Describe quality of fit (variation of data
around straight line) by estimate of s² or r² - 3. Tests of slope and intercept
- 4. Prediction and prediction bands
- 5. ANOVA Table
16Parameters estimated using least-squares
- Age-specific pregnancy rates of female sperm
whales (from Best et al. 1984 Rep. int. Whal.
Commn. Spec. Issue)
Find line which minimizes squares of residuals
171. Estimate ß0 (intercept), ß1 (slope), together
with measures of uncertainty
- Age-specific pregnancy rates of female sperm
whales (from Best et al. 1984 Rep. int. Whal.
Commn. Spec. Issue)
181. Estimate ß0 (intercept), ß1 (slope), together
with measures of uncertainty
- ß0 0.230
- (SE 0.028)
- 95 c.i.
- 0.164 0.296
- ß1 -0.0035
- (SE 0.0009)
- 95 c.i.
- -0.0056 0.0013
192. Describe quality of fit by estimate of s² or
r²
- s² 0.0195
- r2 0.679
- r2 (adjusted) 0.633
- (Propn. variance accounted for by
- regression)
203. Tests of slope and intercept
- a) Slope 0 Equivalent to r0
- b) Slope Predetermined constant
- c) Intercept 0
- d) Intercept Predetermined constant
- e) Compare slopes
- f) Compare intercepts Assume same slope
- (tests use t-distribution)
213a) Slope 0 Equivalent to r0
- Does pregnancy rate change with age?
- H0 ß1 0
- H1 ß1 ? 0
- P0.006
- Does pregnancy rate decline with age?
- H0 ß1 0
- H1 ß1 gt 0
- P0.003
223b) Slope Predetermined constant
- ß1 2.868 (SE 0.058)
- 95 c.i. 2.752 2.984
- Does shape change with length?
- H0 ß1 3
- H1 ß1 ? 3
- Plt0.05
weightlength3
Weights and Lengths of Cetacean Species Whitehead
Mann In Cetacean Societies 2000
233c) Intercept 0
- ß0 0.436 (SE 0.080)
- 95 c.i. 0.276 0.596
- Is birth length proportional to length?
- H0 ß0 0
- H1 ß0 ? 0
- P0.000
243d) Intercept Predetermined constant
253e) Compare slopes
- ß1 (m) 2.528 (SE 0.409)
- ß1 (o) 2.962 (SE 0.094)
- Does shape change differently with length for
odontocetes and mysticetes? - H0 ß1 (m) ß1 (o)
- H1 ß1 (m) ? ß1 (o) P 0.146
Weights and Lengths of Cetacean Species Whitehead
Mann 2000
263f) Compare intercepts Assume same slope
- ß0 (m) 2.528 (SE 0.409)
- ß0 (o) 2.962 (SE 0.094)
- Are odontocetes and mysticetes equally fat?
- H0 ß0 (m) ß0 (o)
- H1 ß0 (m) ? ß0 (o) P 0.781
15
10
Log(Weight)
5
ORDER
m
o
0
0
1
2
3
4
Log(Length)
274. Prediction and prediction bands
95 Confidence Bands for Regression Line
95 Prediction Bands
From http//www.tufts.edu/gdallal/slr.htm
285. ANOVA Table
- Analysis of Variance
- Source Sum-of-Squares df Mean-Square
F-ratio P - Regression 286.27 1 286.27 2475.07 0.00
- Residual 5.32 46 0.12
29If assumptions hold, what can we do?
- 1. Estimate ß0 (intercept), ß1 (slope), together
with measures of uncertainty - 2. Describe quality of fit (variation of data
around straight line) by estimate of s² or r² - 3. Tests of slope and intercept
- 4. Prediction and prediction bands
- 5. ANOVA Table
30Testing assumptions diagnostics
- Use residuals to look at assumptions of
regression - e(i) Y(i) - (ß0 ß1X(i))
Observed
31Residuals
- Residual e(i) Y(i) - (ß0 ß1X(i))
- Standardized residuals e(i)/S
- S is the standard deviation of the residuals
- with adjusted degrees of freedom
- Studentized residuals e(i) / S?(1 - h(i))
- h(i) is the "leverage value" of observation i
- h(i) 1/n (X(i) - SX(i)/n )²/(n-1)S(X)²
- Jackknifed residuals e(i) / S(-i) ?(1 -
h(i)) - The residual variance (S(-i)) is calculated
separately with each observation deleted
32Use Residuals to
- a) look for outliers which we may wish to remove
- b) examine normality
- c) check for linearity
- d) check for homoscedasticity
- e) check for some kinds of non-independence
33a) Using residuals to look for outliers
34Should outliers be removed?
- Yes
- if outlier was probably not produced by the
process being studied - measurement error
- different species
- ...
- No
- if outlier was probably produced by the process
being studied - extreme specimen
35b) Using residuals to examine normality
- Lilliefors test for normality
- P0.62
- Lilliefors test for normality (excluding Bowhead
whale) - P0.68
36c) Using residuals to check for linearity
37d) Use residuals to check for homoscedasticity
38e) Use residuals to check for some kinds of
non-independence
Days spent following sperm whales
- Durbin-Watson D Statistic 1.48
- low values (lt2) indicate autocorrelation
- First Order Autocorrelation 0.26
39Use Residuals to
- a) look for outliers which we may wish to remove
- b) examine normality
- c) check for linearity
- d) check for homoscedasticity
- e) check for some kinds of non-independence
40Assumptions of simple linear regression
- 1. Existence
- 2. Independence
- 3. Linearity
- 4. Homoscedasticity
- 5. Normality
- 6. X measured without error
41When assumptions do not hold
42When assumptions do not hold
- 2. Independence
- collect data differently
- reduce the size of the data set
- add additional terms to the regression model
- (e.g. autocorrelation term, species effect)
- More a problem for testing than prediction
43When assumptions do not hold
- 3. Linearity
- Transform either X or Y or both variables. e.g.
- Log(Y) ß0 ß1 Log(X) E
- Polynomial regression
- Y ß0 ß1 X ß2 X² ... E
- Non-linear regression. e.g.
- Y c EXP(ß0 ß1 X) E
- Piecewise linear regression
- Y ß0 ß1 X XgtXK E
- where Xgt XK0 if Xlt XK and Xgt XK1 if Xgt XK.
44(No Transcript)
45Transformation to improve linearity
46When assumptions do not hold
- 4. Homoscedasticity
- Transformations of the Y variable
- Weighted regressions (if we know that some
observations are more accurate than others)
47Y - transformation to improve homoscedasticity
48When assumptions do not hold
- 5. Normality
- Transformations of the Y variable
- Non-normal error structures (e.g. Poisson)
- Small departures from normality are not
especially important, unless doing a test
49When assumptions do not hold
- 6. X measured without error
- Major axis regression
- Reduced major axis, or geometric mean, regression
50Major axis regression
- Minimize sum of squares of perpendicular
distances from observations to regression line - Only if variables are in same units
- First principal component of covariance matrix
51Reduced major axis regression
- Each of the two variables is transformed to have
a mean of zero and a standard deviation of 1 - Then, minimize sum of squares of perpendicular
distances from observations to regression line - Its slope cannot be sensibly tested against zero
- first principal component using the correlation
matrix
52Regression
- Extremely useful technique!
- Check assumptions using residuals
- Can be extended in several ways
- multiple regression
- non-linear regression
- non-normal errors
- piecewise regression
- ...