Title: Correlation and Regression
1Correlation and Regression
2Spearman's rank correlation
- An alternative to correlation that does not make
so many assumptions - Still measures the strength and direction of
association between two variables - Uses the ranks instead of the raw data
3Example Spearman's rs
VERSIONS 1. Boy climbs up rope, climbs down
again 2. Boy climbs up rope, seems to vanish,
re-appears at top, climbs down again 3. Boy
climbs up rope, seems to vanish at top 4. Boy
climbs up rope, vanishes at top, reappears
somewhere the audience was not looking 5. Boy
climbs up rope, vanishes at top, reappears in a
place which has been in full view
4Hypotheses
H0 The difficulty of the described trick is not
correlated with the time elapsed since it was
observed. HA The difficulty of the described
trick is correlated with the time elapsed since
it was observed.
5East-Indian Rope Trick
6East-Indian Rope Trick
Years elapsed
Impressiveness Score
Rank Years
Rank Impressiveness
7East-Indian Rope Trick
TABLE H
n 21, ? 0.05 Critical value 0.435 P lt
0.05, reject Ho
8Spearmans Rank Correlation - large n
- For large n (gt 100), you can use the normal
correlation coefficient test for the ranks
Under Ho, t has a t-distribution with n-2 d.f.
9Measurement Error and Correlation
- Measurement error decreases the apparent
correlation between two variables
You can correct for this effect - see text
10Species are not independent data points
11Independent contrasts
12Independent contrasts
13(No Transcript)
14Quick Reference Guide - Correlation Coefficient
- What is it for? Measuring the strength of a
linear association between two numerical
variables - What does it assume? Bivariate normality and
random sampling - Parameter ?
- Estimate r
- Formulae
15Quick Reference Guide - t-test for zero linear
correlation
- What is it for? To test the null hypothesis that
the population parameter, ?, is zero - What does it assume? Bivariate normality and
random sampling - Test statistic t
- Null distribution t with n-2 degrees of freedom
- Formulae
16T-test for correlation
Null hypothesis ?0
Sample
Test statistic
Null distribution t with n-2 d.f.
compare
How unusual is this test statistic?
P gt 0.05
P lt 0.05
Reject Ho
Fail to reject Ho
17Quick Reference Guide - Spearmans Rank
Correlation
- What is it for? To test zero correlation between
the ranks of two variables - What does it assume? Linear relationship between
ranks and random sampling - Test statistic rs
- Null distribution See table if ngt100, use
t-distribution - Formulae Same as linear correlation but based on
ranks
18Spearmans rank correlation
Null hypothesis ?0
Sample
Test statistic rs
Null distribution Spearmans rank Table H
compare
How unusual is this test statistic?
P gt 0.05
P lt 0.05
Reject Ho
Fail to reject Ho
19Quick Reference Guide - Independent Contrasts
- What is it for? To test for correlation between
two variables when data points come from related
species - What does it assume? Linear relationship between
variables, correct phylogeny, difference between
pairs of species in both X and Y has a normal
distribution with zero mean and variance
proportional to the time since divergence
20Regression
- The method to predict the value of one numerical
variable from that of another - Predict the value of Y from the value of X
- Example predict the size of a dinosaur from the
length of one tooth
21Linear Regression
- Draw a straight line through a scatter plot
- Use the line to predict Y from X
22Linear Regression Formula
- Y ? ?X
- ? intercept
- The predicted value of Y when X is zero
- ? slope
- the rate of change in Y per unit of change in X
Parameters
23Interpretations of ? ?
higher ?
Y
lower ?
X
X
X
X
negative ?
? 0
positive ?
24Linear Regression Formula
- Y a bX
- a estimated intercept
- The predicted value of Y when X is zero
- b estimated slope
- the rate of change in Y per unit of change in X
25How to draw the line?
Y4
Y3
Y4
Y3
Y
Y2
residuals
Y2
(Y1-Y1)
Y1
Y1
X
26Least-squares Regression
- Draw the line that minimizes the sum of the
squared residuals from the line - Residual is (Yi-Yi)
- Minimize the sum SSresidualsS(Yi-Yi)2
27Formulae for Least-Squares Regression
- The slope and intercept that minimize the sum of
squared residuals are
sum of products
sum of squares for X
28Example How old is that lion?
X proportion black Y age in years
29Example How old is that lion?
30Example How old is that lion?
X proportion black Y age in years X
0.322 Y 4.309 S(X-X)21.222 S(Y-Y)2222.087 S(X
-X)(Y-Y)13.012
31(No Transcript)
32(No Transcript)
33A certain lion has a nose with 0.4 proportion of
black. Estimate the age of that lion.
34Standard error of the slope
Sum of squares Sum of products
35Lion Example, continued
36Confidence interval for the slope
37Lion Example, continued
38Predicting Y from X
- What is our confidence for predicting Y from X?
- Two types of predictions
- What is the mean Y for each value of X?
- Confidence bands
- What is a particular individual Y at each value
of X? - Prediction intervals
39Predicting Y from X
- Confidence bands measure the precision of the
predicted mean Y for each value of X - Prediction intervals measure the precision of
predicted single Y values for each value of X
40Predicting Y from X
Confidence bands
Prediction interval
41Predicting Y from X
Confidence bands
Prediction interval
How confident can we be about the regression
line?
How confident can we be about the predicted
values?
42Testing Hypotheses about a Slope
- t-test for regression slope
- Ho There is no linear relationship between X and
Y (? 0) - Ha There is a linear relationship between X and
Y (? ? 0)
43Testing Hypotheses about a Slope
- Test statistic t
- Null distribution t with n-2 d.f.
44Lion Example, continued
df n-2 32-2 30
Critical value 2.04 7.05 gt 2.04 so we reject the
null hypothesis
Conclude that ??0
45Testing Hypotheses about a Slope ANOVA approach
Source of variation Sum of squares df Mean squares F P
Regression 1
Residual n-2
Total n-1
46Lion Example, continued
Source of variation Sum of squares df Mean squares F P
Regression 138.54 1 138.54 49.7 lt0.001
Residual 83.55 30 2.785
Total 222.09 31
47Testing Hypotheses about a Slope R2
- R2 measures the fit of a regresion line to the
data - Gives the proportion of variation in Y that is
explained by variation in X
R2 SSregression
SStotal
48Lion Example, Continued
49Assumptions of Regression
- At each value of X, there is a population of Y
values whose mean lies on the true regression
line - At each value of X, the distribution of Y values
is normal - The variance of Y values is the same at all
values of X - At each value of X the Y measurements represent a
random sample from the population of Y values
50Detecting Linearity
- Make a scatter plot
- Does it look like a curved line would fit the
data better than a straight one?
51Non-linear relationship Number of fish species
vs. Size of desert pool
52Taking the log of area
53Detecting non-normality and unequal variance
- These are best detected with a residual plot
- Plot the residuals (Yi-Yi) against X
- Look for
- symmetric cloud of points
- Little noticeable curvature
- Equal variance above and below the line
54Residual plots help assess assumptions
Original
Residual plot
55Transformed data
Logs
Residual plot
56What if the relationship is not a straight line?
- Transformations
- Non-linear regression
57Transformations
- Some (but not all) nonlinear relationships can be
made linear with a suitable transformation - Most common log transform Y, X, or both