Title: Correlation and Simple Linear Regression
1Correlation and Simple Linear Regression
2Correlation Research questions
- Is there an association between age and blood
pressure? - To assess whether two variables are associated,
i.e. of the values of one variable tend to be
higher (or lower) for values of the other
variable - Associations between two continuous variables
3Correlation
- Measures the strength of linear association
between two continuous variables - Can be positive or negative
- Can vary between -1 and 1
- Does not imply causation (there may be some other
factor that can explain the association).
4Correlation and causation
r0.61
5A note on correlation
- It does not mean that one variable causes the
other - Coffee consumption and road traffic accidents are
strongly associated but that does not indicate
that drinking coffee causes road traffic accidents
6Pearson correlation coefficient
- subject body plasma weight
volume - 1 58.0 2.75
- 2 70.0 2.86
- 3 74.0 3.37
- 4 63.5 2.76
- 5 62.0 2.62
- 6 70.5 3.49
- 7 71.0 3.05
- 8 66.0 3.12
7Correlation coefficient
- The correlation coefficient is calculated as
- r Covariance between X and Y
- ?(Variance of X variance of Y)
8Pearson correlation coefficient
- r-1 Strong negative linear relationship
- As the value of X increases the value of Y
decreases - r0 No linear relationship between X and Y
- r1 Strong positive linear relationship
- As the value of X increases the value of Y
increases
9Pearson correlation coefficient
r approaching -1
r approaching 1
10Hypothesis test for correlation coefficient
- It is possible to test whether a correlation
coefficient differs significantly from zero - The test statistic for the correlation
coefficient follows a t-distribution when the
null hypothesis is true
H0 ? 0 vs. H1 ? ? 0
11Hypothesis test for correlation coefficient
- The significance of the correlation coefficient
will depend on the size of the correlation
coefficient and the number of observations in the
sample - The validity of this test requires that the
variables are observed on a random sample of
individuals and at least one of the variables
follows a normal distribution
12Correlation matrix
Correlation 0.814 P-value lt0.001 Number 252
13Assumptions of correlation
- Assumptions of distribution
- Hypothesis test - at least one variable normally
distributed - Confidence interval - both variables must be
normally distributed
14Non-parametric correlation
- When data is ordinal
- or the data is not Normally distributed,
- a rank correlation method can be applied
(Spearmans rank correlation)
15Example Spearmans rank correlation
- A study was conducted to investigate the
relationship between anxiety score for a child
evaluated by the child him/herself and by that
childs mother. - Childrens anxiety scores measured on a
continuous scale, mothers anxiety scores
measured on an ordinal scale 1-7. - The null hypothesis is no relationship between
childrens and mothers evaluations of childrens
anxiety.
16Example Spearmans rank correlation
- The correlation coefficient is calculated in the
same way as for Pearsons correlation
coefficient, except that it is calculated on the
ranks and not the actual values. - It ranges from -1 to 1 and has the same
interpretation. - No requirement for the data to follow a Normal
distribution.
17Example Spearmans rank correlation
Correlation is significant at 5 (P lt
0.05), so the null hypothesis is rejected,
meaning that there is a relationship between
childrens and mothers evaluation of childrens
anxiety
Correlation 0.638 P-value 0.035 Number 11
18Use and misuse of correlation
- All observations should be independent
- only one observation of each variable should come
from each individual in the study - Data dredging
- 10 variables, 45 possible correlations 20
variables, 190 possible correlations! - Assessing agreement
- Relationships between a part to a whole
- total cholesterol and LDL cholesterol (total
cholesterol is the sum of 3 types of cholesterol)
19When not to use correlation
- Spurious correlations involving time
- E.g. Positive correlation between a stork
population and human birth rates in an area of
the Netherlands - Both variables increasing with time and so appear
to be highly correlated - Should look at many areas rather than one area
over time
20Simple linear regression
21Research questions
- How does systolic blood pressure change as age
increases? - Can systolic blood pressure be predicted from a
subjects age? - Can body fat be predicted from abdomen
circumference measurements?
22Simple Linear Regression
- Simple linear regression describes the
relationship between two continuous variables - Simple linear regression gives the equation of
the straight line that best describes the
association between two continuous variables. - It enables the prediction of one variable using
information from another variable.
23Types of Variable in Linear Regression
- The dependent variable is the variable to be
predicted (i.e. the particular outcome of
interested). - The independent variable or explanatory variable
is the variable used for predicting the
particular outcome.
24Equation of a straight line
- The equation of a straight line is y ?a bx
- y is the predicted value (of the dependent
variable) - a is the intercept
- b is the slope (or gradient) of the line
- x is the independent (explanatory) variable
25Least squares
- The values of a and b are calculated to minimise
the sum of the squared vertical distance from the
regression line to the dependent variable. This
is called the least squares fit. - This is the difference between the actual value
of the dependent variable and the predicted value
from the regression line for each value of the
independent variable
26Regression coefficient (b)
- The slope, b, is often called the regression
coefficient - It has the same sign as the correlation
coefficient - When there is no correlation between x and y,
then the regression coefficient, b, will equal 0
27Residuals
- y a bx ?
- ? is termed the residual.
- The residual is the difference between the
predicted value y (as calculated from the
regression equation) and the observed value y. So
residual (y-y) - A residual is calculated for each observation.
- The method of least squares attempts to minimise
the sum of squared residuals. - Mathematical techniques are used to find the
values of a and b which satisfy the least squares
fit.
28Predicted value (y)
- The predicted value, y, is subject to sampling
variation - Its precision can be estimated (prediction error)
by the standard error of the estimate - The greater the standard error, the greater the
dispersion of predicted y values around the
regression line and hence the larger the
prediction error
29Example
- A fitness gym wishes to assess their clients
body fat. An accurate method of measuring body
fat is using an underwater weighing technique.
This is not a practical method for the fitness
instructors to carry out on the premises. They
would like to be able to predict their clients
body fat from other measurements, more easily
obtainable from the client. - 252 men had their body fat measured and their
abdomen circumference
30Testing hypothesis
- H0 There is no linear relationship between body
fat and abdomen circumference in the population - H1 There is a linear relationship between body
fat and abdomen circumference in the population - Or this can be rephrased as
- H0 Abdomen circumference does not account for
any variability in body fat in the population - H1Abdomen circumference does account for some of
the variability in body fat in the population
31Assess whether linear relationship exists
- The scatterplot of body fat versus abdomen
circumference indicates that there is a strong
positive relationship between the two variables - Recall that the correlation coefficient was 0.814
32Simple linear regression in SPSS
- Analyze
- Regression
- Linear
- The dependent variable is body fat
- The independent variable is abdomen circumference
33SPSS linear regression
- R is the correlation between the two variables
0.814 - R square is the proportion of variability in body
fat measurements that can be explained by
differences in abdomen circumference 0.662 or
66.2
34SPSS linear regression
- A statistically significant proportion of the
variability in body fat measurements can be
attributed to the regression model
35SPSSregression equation
- Predicted body fat constant B abdomen circum.
- Predicted body fat -35.197 0.585 abdomen
circum.
36Prediction
- How do you use linear regression for prediction?
- The regression equation allows you to predict the
value of the dependent variable (Y) for a
particular value of the independent variable (X) - Predicted body fat -35.197 0.585 abdomen
circum - What is the predicted body fat content for a man
with an abdomen circumference of 100cm? - Predicted body fat -35.197 0.585 x 100cm
- -35.197 58.5
- 23.3
37Assumptions of linear regression
- There should be a linear relationship between the
dependent variable and the independent variable - For any value of the independent variable the
dependent variable values should follow a Normal
distribution (ie normally distributed residuals) - The variance of the dependent variable values
should be the same for all independent variable
values.
38Checking the assumptions
- After the regression model has been fitted to the
data it is essential to check that the
assumptions of linear regression have not been
violated - If any of the assumptions have been violated then
the regression model is likely to be invalid
39Assumptions Linearity(1)
- Plot the dependent variable against the
independent variable - Linear pattern (sausage shape) if linearity
assumption to hold - Assumption satisfied
40Assumptions Linearity(2)
- Plot the residuals against the predicted values
- No curvature in the plot should be seen for the
linearity assumption to hold - Assumption satisfied
41AssumptionsNormal residuals (1)
- Normally distributed residuals can be tested by
looking at a histogram of the residuals - Assumption satisfied
42AssumptionsNormal residuals (2)
- Normally distributed residuals can be tested by
looking at a normal probability plot - Assumption satisfied
43Assumption Constant variance
- Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values - There should be an even spread of residuals
around zero - Assumption satisfied
44Assumption constant variance
- This assumption would not be satisfied if the
spread of the residuals increased or decreased as
the predicted values increase in size - The plot should illustrate a random relationship
45Summary correlation
- Measures the strength of a linear association
between two variables (usually continuous or
discrete). - High positive or negative correlations suggest
that two variables are related (but not that one
causes the other). - Looking at scatterplots of the variables is
always a good idea. - Check for common influences such as time or age
which may affect both of the variables.
46Summary simple linear regression
- Simple Linear regression gives the equation of
the straight line that best describes the
association between two variables. - A linear relationship between the dependent
variable and the independent variable is
required. - For any value of the independent variable the
dependent variable values should follow a Normal
distribution. - The variance of the dependent variable values
should be the same for all independent variable
values.