Title: Multiple linear regression
1Multiple linear regression
2Recap
- Correlation - indicates the strength of linear
relationship between two variables - Simple linear regression will describe the linear
relationship between two variables - For linear regression to be valid, the
assumptions of linearity normality of
residuals and constant variance must all hold
3Linear regression equation
- y a b x
- y intercept ( slope x x )
4No intercept example
y b x
5Good fit Interpreting R2
Identical regression lines. R2 high on left and
low on right
6Statistical inference in regression
- The regression coefficients calculated from a
sample of observations are estimates of the
population regression coefficients - Hypothesis tests and confidence intervals can be
constructed using the sample estimates to make
inferences about the population regression
coefficients - For the valid use of these inferential
approaches, it is necessary to check the
underlying distribution of the data (linearity,
normality, constant variance)
7Multiple linear regression
- Situations frequently occur when we are
interested in the dependency of a variable on
several explanatory (independent) variables. - The joint influence of the variables, taking into
account possible correlations among them, may be
investigated using multiple regression - Multiple regression can be extended to any number
of variables, although it is recommended that the
number be kept reasonably small
8Partitioning of variation in dependent variable
Systolic blood pressure
Gender
Age
9Situations where multiple linear regression is
appropriate
- To explore the dependency of one outcome variable
on two or more explanatory variables
simultaneously - development of a prognostic index
- To study the relationship between two variables
after removing the possible effects of other
nuisance variables - To adjust for differences in confounding factors
between groups
10Research questions
- Data on cystic fibrosis patients was collected.
The researchers were interested in looking at
what factors are related to patients malnutrition
(as measured by PEmax). Data was available on
age, sex, height, weight, BMI, lung capacity,
FEV1, and other lung function variables. - Researchers would like to predict a persons
percentage body fat using measurements of bicep
circumference, abdomen circumference, height,
weight and age of the subject.
11Research questions
- To investigate the effect of parental birth
weight on infant birth weight. Strong
relationship found. Other explanatory variables
such as maternal height, number of previous
children, maternal smoking, weight gain during
pregnancy (all of which are known to be
associated with infant birth weight) were
collected. - Multiple regression analysis was conducted to
assess whether the observed association between
parental birth weight and infant birth weight
could be explained by inter-relationships between
parental birth weight and the additional
variables. It might be that mothers with low
birth weights were more likely to smoke.
12Research question
- Two groups (non-randomised) of patients are
receiving two different drug treatments for
hypertension. - The effectiveness of the drugs are to be assessed
by measuring each patients blood pressure six
months following treatment. - A comparison of the characteristics of the
patients in the two groups indicates that
patients on drug A are older than those on Drug
B. - There is a known relationship between age and
blood pressure. - Multiple linear regression is used to adjust for
(remove) the effect of age on blood pressure
before carrying out a comparison of the two
treatments.
13Model
- y a b1x1 b2x2 b3x3 bkxk
- y - dependent variable
- y- predicted value of dependent variable
- a - intercept (constant)
- b1 - regression coefficient for x1
- x1 - explanatory (independent) variable
- b2 - regression coefficient for x2
- x2 - explanatory (independent) variable
14Multiple correlation
- R - coefficient of multiple correlation
- It is the correlation between Y and the combined
predictors (x1, x2 xk) - R2 - coefficient of multiple determination
- It is the proportion of variance in Y that can be
accounted for by the combined predictors (x1, x2
xk)
15Collinearity
- Occurs when the explanatory variables are
correlated to one another - Extreme multicollinearity occurs when one
explanatory variable is a linear function of some
of the other explanatory variables
16Collinearity Cystic Fibrosis example
- Data on cystic fibrosis patients. What factors
are related to patients malnutrition (measured by
PEmax)? - A regression model included height and weight
(r0.92) as explanatory variables and PEmax
(index of malnutrition) as the dependent variable - Both height and weight were highly correlated
with PEmax - The model with these two variables accounted for
40 of the variation in PEmax - In the model, neither of the coefficients for
weight or height were significant - Including both these highly correlated variables
obscured their relationship with PEmax
17Criteria for inclusion in the model
- The variable should account for a significant
proportion of the variation in the dependent
variable - This can be assessed by either of the following
two comparable tests - F test from the ANOVA table
- t-test of the regression coefficient (B)
18Criteria for variable to be entered
- F-test
- H0 The independent (explanatory) variable does
not account for any of the variability in body
fat in the population - F ratio 1
- H1Abdomen circumference does account for some of
the variability in body fat in the population - F ratio gt 1
19Criteria for variable to be entered
- t-test
- H0 The regression coefficient for the
explanatory variable is equal to zero - (b10)
- H1 The regression coefficient for the
explanatory variable is not equal to zero - (b1?0)
20Selection of explanatory variables
- Methods by which explanatory variables are
selected for inclusion in the regression model - Enter
- Forward selection
- Backward selection
- Stepwise selection
- The use of different selection methods on the
same dataset may result in different regression
models
21Enter
- The explanatory (independent) variables are
forced into the model - Examination of the output from the regression
model will indicate whether each of the
explanatory variables are explaining a
significant proportion of the variation in the
dependent variable - We can test whether the coefficient for each
explanatory variable differs significantly from 0
22Automatic selection procedures
- Should be cautious in the use of these procedures
- These procedures should be used in combination
with the data analysts knowledge and common sense - Models selected using these automatic methods
alone are based on mathematical relationships and
may not make biological/clinical sense
23Forward selection
- Simple linear regressions carried out for each
explanatory variable - The one variable which accounts for the most
variability is selected. - Linear regressions with all pairs of explanatory
variables (including first) are carried out - The regression which accounts for the most
variability in the dependent variable is selected - and so on ...
24Backward selection
- Multiple regression is performed using all the
explanatory variables - Each explanatory variable is dropped in turn and
the one that has the least contribution is
dropped - All combinations of this explanatory variable and
one other are dropped from the model - The next one which contribute least to the model
is removed - and so on ...
25Stepwise selection
- This approach combines both forward and backward
selection procedures - A variable may be added to the model, but at each
step all variables in the model are considered
for exclusion - Can be forward or backward stepwise selection
- SPSS adopts a forward stepwise procedure
26Stepwise selection
- At each stage in a stepwise selection procedure,
explanatory variables already entered in the
model are assessed to see whether they still
account for a significant proportion of the
variation in the dependent variable - At each stage in a stepwise procedure
- all explanatory variables not in the model are
assessed for inclusion - all explanatory variables in the model are
assessed for removal
27Example
- Recall fitness gym example
- Dependent variable - percentage body fat
- Explanatory variables
- age
- weight
- height
- hip, biceps, neck, knee, forearm, abdomen
circumference measurements
28Example
- The aim is to produce an equation which would
allow us to predict percentage body fat based on
alternative measurements - Selection procedure
- Stepwise
- At the SPSS dialogue box, enter all the
explanatory (independent) variables you wish to
be considered for inclusion and then select
stepwise as the method
29SPSS output multiple regression
30SPSS output multiple linear regression
- Each model produced is reported
- The R square for each model indicates the
proportion of variability in the dependent
variable accounted for by that model - Note the standard error of the estimate is
reduced with each additional variable entered
31SPSS output multiple regression
32SPSS output multiple regression
33Variables not in the model
34SPSS output multiple regression
35Prediction
- The predicted percentage body fat for a man with
an abdomen circumference of 100 cm, height of 168
cm and a thigh circumference of 57 cm
36Checking assumptions
- After a model has been fitted to the data, it is
essential to check that the assumptions of
multiple linear regression have not been violated
37Checking assumptions
- There should be a linear relationship between the
dependent variable and ALL continuous/discrete
explanatory variables. - For any value of x, the predicted values should
be normally distributed (normally distributed
residuals) - The variability of the predicted values is the
same for all values of x (constant variance)
38Assumptions Linearity (1a)
- Plot the dependent variable against each of the
explanatory (independent) variables - Abdomen circumference
- r0.8
39Assumptions Linearity (1b)
- Plot the dependent variable against each of the
explanatory (independent) variables - Height
- r0.6
40Assumptions Linearity (1c)
- Plot the dependent variable against each of the
explanatory (independent) variables - Thigh circumference
- r0.56
41Assumptions Linearity (2)
- Plot the residuals against the predicted values
- No curvature in the plot should be seen for the
linearity assumption to hold - Assumption satisfied
42Assumptions Normal residuals (1)
- Normally distributed residuals can be tested by
looking at a histogram of the residuals - Assumption satisfied
43Assumptions Normal residuals (2)
- Normally distributed residuals can be tested by
looking at a normal probability plot - Assumption satisfied
44Assumption Constant variance
- Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values - There should be an even spread of residuals
around zero - Assumption satisfied
45General issues multiple regression
- Types of explanatory variables
- Exploratory and confirmatory analysis
- Number of explanatory variables
- Number of observations
- Interaction terms
46Explanatory variables inmultiple linear
regression
- Explanatory variables can be continuous or
categorical - If a dichotomous variable (coded 0, 1 or 1, 2) is
included in the regression equation the
regression coefficient for this variable
indicates the average difference in the dependent
variable between the two groups defined by the
dichotomous variable - This is adjusted for any differences between the
groups with respect to the other variables in the
model - Dummy variables are required for nominal variables
47One binary explanatory variable and one
continuous explanatory variable
- 2 independent variables one binary one
continuous
y a b1 x gender b2 x age
If gender (1male, 2female)
then intercepts differ for males
females. Constant for males is a b1 x 1
and for females is a b1 x 2 a b1
b1 Slope of outcome with age is the same for
males females.
48Dummy variables (1)
- Adopted when you have more than two categories
and the variable is not ordinal - e.g. marital status
- married/living with partner
- Single
- divorced/widowed
- As there are three categories, two dummy
variables need to be defined
49Dummy variables (2)
- d1 d2
- Married/Liv partner 0 0
- Single 1 0
- Divorced/Widowed 0 1
- Reference category Married/Living with partner
- Both dummy variables must be entered into the
regression model to assess whether marital status
can explain a significant proportion of the
variation in the dependent variable
50Exploratory vs confirmatory
- Multiple regression is relatively straight
forward when we know which variables we wish to
have in the model - Difficulties can occur when we wish to identify
from a large number of variables those which are
related to the dependent variable and assess how
well the model obtained fits the data - Exploratory and confirmatory analyses on the same
data can be a problem
51Some further comments
- Number of potential explanatory variables
- beware of initial screening of variables
- Multiple testing
- Number of observations and number of explanatory
variables - (Rule of thumb - 10 observations per explanatory
variable) - Use common sense when automatic procedures for
model selection are used - Automatic selection procedures are advantageous
when explanatory variables are highly correlated
52And more .
- There is a risk that the model may be over
optimistic so the predictive capability of a
model should be assessed using an independent
data set - One option is to split the data into two samples
- One sample (half your data) is used to develop
the linear model, then the model is tested on the
other sample (remainder of data)
53Interaction in linear regression
- Interaction terms
- The relationship between an explanatory and the
dependent variable may differ for different
grouping of a variable - eg the relationship between age and blood
pressure may be different for males and females - An additional explanatory variable would be added
to the model to test whether there is a
statistically significant difference in the
relationship between males and females
54Interaction
y a b1 x gender b2 x age b3 x gender x
age b2 is slope with age for males (coded 1) b2
b3 is slope with age for females (coded 2) b3
is additional slope with age for females
relative to slope with age for males
55Multiple linear regression and ANOVA
- Large overlap between linear regression and ANOVA
- Multiple regression where all explanatory
variables are categorical is in fact the same as
an ANOVA with several factors - The two approaches give identical results
56Comparison of statistical techniques (1)
- There are similarities between t-test, ANOVA,
ANCOVA and linear regression - A simple example to illustrate this would be to
examine the effect of gender on weight - Option 1 - t-test
- Option 2 - ANOVA
- Option 3 - regression
57T-test
- Difference in mean weight between males and
females 28.2 lbs - t-test
- t8.878 df215 Plt0.001
58ANOVA
- The variability in weight that can be explained
by differences in gender is significant when
compared to the amount of variability remaining
unexplained - Note variability is partitioned into between and
within groups - F78.2 Plt 0.001
59Linear regression
- ANOVA table exactly the same as earlier one
- Note variation in weight is partitioned into
regression and residual - F-test F78.2 Plt 0.001
60Linear regression
- Coding
- Gender (1male, 2female)
- Using the regression equation above, estimate the
average weight for males and the average weight
for females.
61Linear regression
- Recall from t-test difference in means 28.2 lbs
- Weight (lbs) 195.9 28.2 x gender
- Mean weight for males (gender 1) 167.75
lbs - Mean weight for females (gender 2) 139.58
lbs - t-test t 8.878 df 215 Plt0.001
62Comparison of statistical techniques (2)
- Is there a difference in weight between males and
females after accounting for any difference due
to age - Option 1 - ANCOVA
- Option 2 - Multiple linear regression
- Both will provide the same answer, that after
adjusting weight for age, there is still a
significant gender effect
63Assignment
- Due in on Monday 12 noon
- Before, in break of or immediately after Monday
9-12 lecture - Remember to follow the instructions
- Describe the data, choosing the important summary
information relevant for each variable - Use tables or graphs if message clearer
- When making comparisons (performing tests)
identify and present only the important
information - Give actual p-values
- Give direction of differences and size if
available