Multiple linear regression

About This Presentation

Title:

Multiple linear regression

Description:

Correlation - indicates the strength of linear relationship between two variables ... hip, biceps, neck, knee, forearm, abdomen circumference measurements. Example ... – PowerPoint PPT presentation

Number of Views:644

Avg rating:3.0/5.0

Slides: 64

Provided by: jillmo9

Category:

more less

Transcript and Presenter's Notes

Title: Multiple linear regression

1
Multiple linear regression

Gordon Prescott

2
Recap

Correlation - indicates the strength of linear
relationship between two variables
Simple linear regression will describe the linear
relationship between two variables
For linear regression to be valid, the
assumptions of linearity normality of
residuals and constant variance must all hold

3
Linear regression equation

y a b x
y intercept ( slope x x )

4
No intercept example
y b x
5
Good fit Interpreting R2
Identical regression lines. R2 high on left and
low on right
6
Statistical inference in regression

The regression coefficients calculated from a
sample of observations are estimates of the
population regression coefficients
Hypothesis tests and confidence intervals can be
constructed using the sample estimates to make
inferences about the population regression
coefficients
For the valid use of these inferential
approaches, it is necessary to check the
underlying distribution of the data (linearity,
normality, constant variance)

7
Multiple linear regression

Situations frequently occur when we are
interested in the dependency of a variable on
several explanatory (independent) variables.
The joint influence of the variables, taking into
account possible correlations among them, may be
investigated using multiple regression
Multiple regression can be extended to any number
of variables, although it is recommended that the
number be kept reasonably small

8
Partitioning of variation in dependent variable
Systolic blood pressure
Gender
Age
9
Situations where multiple linear regression is
appropriate

To explore the dependency of one outcome variable
on two or more explanatory variables
simultaneously
development of a prognostic index
To study the relationship between two variables
after removing the possible effects of other
nuisance variables
To adjust for differences in confounding factors
between groups

10
Research questions

Data on cystic fibrosis patients was collected.
The researchers were interested in looking at
what factors are related to patients malnutrition
(as measured by PEmax). Data was available on
age, sex, height, weight, BMI, lung capacity,
FEV1, and other lung function variables.
Researchers would like to predict a persons
percentage body fat using measurements of bicep
circumference, abdomen circumference, height,
weight and age of the subject.

11
Research questions

To investigate the effect of parental birth
weight on infant birth weight. Strong
relationship found. Other explanatory variables
such as maternal height, number of previous
children, maternal smoking, weight gain during
pregnancy (all of which are known to be
associated with infant birth weight) were
collected.
Multiple regression analysis was conducted to
assess whether the observed association between
parental birth weight and infant birth weight
could be explained by inter-relationships between
parental birth weight and the additional
variables. It might be that mothers with low
birth weights were more likely to smoke.

12
Research question

Two groups (non-randomised) of patients are
receiving two different drug treatments for
hypertension.
The effectiveness of the drugs are to be assessed
by measuring each patients blood pressure six
months following treatment.
A comparison of the characteristics of the
patients in the two groups indicates that
patients on drug A are older than those on Drug
B.
There is a known relationship between age and
blood pressure.
Multiple linear regression is used to adjust for
(remove) the effect of age on blood pressure
before carrying out a comparison of the two
treatments.

13
Model

y a b1x1 b2x2 b3x3 bkxk
y - dependent variable
y- predicted value of dependent variable
a - intercept (constant)
b1 - regression coefficient for x1
x1 - explanatory (independent) variable
b2 - regression coefficient for x2
x2 - explanatory (independent) variable

14
Multiple correlation

R - coefficient of multiple correlation
It is the correlation between Y and the combined
predictors (x1, x2 xk)
R2 - coefficient of multiple determination
It is the proportion of variance in Y that can be
accounted for by the combined predictors (x1, x2
xk)

15
Collinearity

Occurs when the explanatory variables are
correlated to one another
Extreme multicollinearity occurs when one
explanatory variable is a linear function of some
of the other explanatory variables

16
Collinearity Cystic Fibrosis example

Data on cystic fibrosis patients. What factors
are related to patients malnutrition (measured by
PEmax)?
A regression model included height and weight
(r0.92) as explanatory variables and PEmax
(index of malnutrition) as the dependent variable
Both height and weight were highly correlated
with PEmax
The model with these two variables accounted for
40 of the variation in PEmax
In the model, neither of the coefficients for
weight or height were significant
Including both these highly correlated variables
obscured their relationship with PEmax

17
Criteria for inclusion in the model

The variable should account for a significant
proportion of the variation in the dependent
variable
This can be assessed by either of the following
two comparable tests
F test from the ANOVA table
t-test of the regression coefficient (B)

18
Criteria for variable to be entered

F-test
H0 The independent (explanatory) variable does
not account for any of the variability in body
fat in the population
F ratio 1
H1Abdomen circumference does account for some of
the variability in body fat in the population
F ratio gt 1

19
Criteria for variable to be entered

t-test
H0 The regression coefficient for the
explanatory variable is equal to zero
(b10)
H1 The regression coefficient for the
explanatory variable is not equal to zero
(b1?0)

20
Selection of explanatory variables

Methods by which explanatory variables are
selected for inclusion in the regression model
Enter
Forward selection
Backward selection
Stepwise selection
The use of different selection methods on the
same dataset may result in different regression
models

21
Enter

The explanatory (independent) variables are
forced into the model
Examination of the output from the regression
model will indicate whether each of the
explanatory variables are explaining a
significant proportion of the variation in the
dependent variable
We can test whether the coefficient for each
explanatory variable differs significantly from 0

22
Automatic selection procedures

Should be cautious in the use of these procedures
These procedures should be used in combination
with the data analysts knowledge and common sense
Models selected using these automatic methods
alone are based on mathematical relationships and
may not make biological/clinical sense

23
Forward selection

Simple linear regressions carried out for each
explanatory variable
The one variable which accounts for the most
variability is selected.
Linear regressions with all pairs of explanatory
variables (including first) are carried out
The regression which accounts for the most
variability in the dependent variable is selected
and so on ...

24
Backward selection

Multiple regression is performed using all the
explanatory variables
Each explanatory variable is dropped in turn and
the one that has the least contribution is
dropped
All combinations of this explanatory variable and
one other are dropped from the model
The next one which contribute least to the model
is removed
and so on ...

25
Stepwise selection

This approach combines both forward and backward
selection procedures
A variable may be added to the model, but at each
step all variables in the model are considered
for exclusion
Can be forward or backward stepwise selection
SPSS adopts a forward stepwise procedure

26
Stepwise selection

At each stage in a stepwise selection procedure,
explanatory variables already entered in the
model are assessed to see whether they still
account for a significant proportion of the
variation in the dependent variable
At each stage in a stepwise procedure
all explanatory variables not in the model are
assessed for inclusion
all explanatory variables in the model are
assessed for removal

27
Example

Recall fitness gym example
Dependent variable - percentage body fat
Explanatory variables
age
weight
height
hip, biceps, neck, knee, forearm, abdomen
circumference measurements

28
Example

The aim is to produce an equation which would
allow us to predict percentage body fat based on
alternative measurements
Selection procedure
Stepwise
At the SPSS dialogue box, enter all the
explanatory (independent) variables you wish to
be considered for inclusion and then select
stepwise as the method

29
SPSS output multiple regression
30
SPSS output multiple linear regression

Each model produced is reported
The R square for each model indicates the
proportion of variability in the dependent
variable accounted for by that model
Note the standard error of the estimate is
reduced with each additional variable entered

31
SPSS output multiple regression
32
SPSS output multiple regression
33
Variables not in the model
34
SPSS output multiple regression
35
Prediction

The predicted percentage body fat for a man with
an abdomen circumference of 100 cm, height of 168
cm and a thigh circumference of 57 cm

36
Checking assumptions

After a model has been fitted to the data, it is
essential to check that the assumptions of
multiple linear regression have not been violated

37
Checking assumptions

There should be a linear relationship between the
dependent variable and ALL continuous/discrete
explanatory variables.
For any value of x, the predicted values should
be normally distributed (normally distributed
residuals)
The variability of the predicted values is the
same for all values of x (constant variance)

38
Assumptions Linearity (1a)

Plot the dependent variable against each of the
explanatory (independent) variables
Abdomen circumference
r0.8

39
Assumptions Linearity (1b)

Plot the dependent variable against each of the
explanatory (independent) variables
Height
r0.6

40
Assumptions Linearity (1c)

Plot the dependent variable against each of the
explanatory (independent) variables
Thigh circumference
r0.56

41
Assumptions Linearity (2)

Plot the residuals against the predicted values
No curvature in the plot should be seen for the
linearity assumption to hold
Assumption satisfied

42
Assumptions Normal residuals (1)

Normally distributed residuals can be tested by
looking at a histogram of the residuals
Assumption satisfied

43
Assumptions Normal residuals (2)

Normally distributed residuals can be tested by
looking at a normal probability plot
Assumption satisfied

44
Assumption Constant variance

Constant variance of the residuals can be
assessed by plotting the residuals against the
predicted values
There should be an even spread of residuals
around zero
Assumption satisfied

45
General issues multiple regression

Types of explanatory variables
Exploratory and confirmatory analysis
Number of explanatory variables
Number of observations
Interaction terms

46
Explanatory variables inmultiple linear
regression

Explanatory variables can be continuous or
categorical
If a dichotomous variable (coded 0, 1 or 1, 2) is
included in the regression equation the
regression coefficient for this variable
indicates the average difference in the dependent
variable between the two groups defined by the
dichotomous variable
This is adjusted for any differences between the
groups with respect to the other variables in the
model
Dummy variables are required for nominal variables

47
One binary explanatory variable and one
continuous explanatory variable

2 independent variables one binary one
continuous

y a b1 x gender b2 x age
If gender (1male, 2female)
then intercepts differ for males
females. Constant for males is a b1 x 1
and for females is a b1 x 2 a b1
b1 Slope of outcome with age is the same for
males females.
48
Dummy variables (1)

Adopted when you have more than two categories
and the variable is not ordinal
e.g. marital status
married/living with partner
Single
divorced/widowed
As there are three categories, two dummy
variables need to be defined

49
Dummy variables (2)

d1 d2
Married/Liv partner 0 0
Single 1 0
Divorced/Widowed 0 1
Reference category Married/Living with partner
Both dummy variables must be entered into the
regression model to assess whether marital status
can explain a significant proportion of the
variation in the dependent variable

50
Exploratory vs confirmatory

Multiple regression is relatively straight
forward when we know which variables we wish to
have in the model
Difficulties can occur when we wish to identify
from a large number of variables those which are
related to the dependent variable and assess how
well the model obtained fits the data
Exploratory and confirmatory analyses on the same
data can be a problem

51
Some further comments

Number of potential explanatory variables
beware of initial screening of variables
Multiple testing
Number of observations and number of explanatory
variables
(Rule of thumb - 10 observations per explanatory
variable)
Use common sense when automatic procedures for
model selection are used
Automatic selection procedures are advantageous
when explanatory variables are highly correlated

52
And more .

There is a risk that the model may be over
optimistic so the predictive capability of a
model should be assessed using an independent
data set
One option is to split the data into two samples
One sample (half your data) is used to develop
the linear model, then the model is tested on the
other sample (remainder of data)

53
Interaction in linear regression

Interaction terms
The relationship between an explanatory and the
dependent variable may differ for different
grouping of a variable
eg the relationship between age and blood
pressure may be different for males and females
An additional explanatory variable would be added
to the model to test whether there is a
statistically significant difference in the
relationship between males and females

54
Interaction
y a b1 x gender b2 x age b3 x gender x
age b2 is slope with age for males (coded 1) b2
b3 is slope with age for females (coded 2) b3
is additional slope with age for females
relative to slope with age for males
55
Multiple linear regression and ANOVA

Large overlap between linear regression and ANOVA
Multiple regression where all explanatory
variables are categorical is in fact the same as
an ANOVA with several factors
The two approaches give identical results

56
Comparison of statistical techniques (1)

There are similarities between t-test, ANOVA,
ANCOVA and linear regression
A simple example to illustrate this would be to
examine the effect of gender on weight
Option 1 - t-test
Option 2 - ANOVA
Option 3 - regression

57
T-test

Difference in mean weight between males and
females 28.2 lbs
t-test
t8.878 df215 Plt0.001

58
ANOVA

The variability in weight that can be explained
by differences in gender is significant when
compared to the amount of variability remaining
unexplained
Note variability is partitioned into between and
within groups
F78.2 Plt 0.001

59
Linear regression

ANOVA table exactly the same as earlier one
Note variation in weight is partitioned into
regression and residual
F-test F78.2 Plt 0.001

60
Linear regression

Coding
Gender (1male, 2female)
Using the regression equation above, estimate the
average weight for males and the average weight
for females.

61
Linear regression

Recall from t-test difference in means 28.2 lbs
Weight (lbs) 195.9 28.2 x gender
Mean weight for males (gender 1) 167.75
lbs
Mean weight for females (gender 2) 139.58
lbs
t-test t 8.878 df 215 Plt0.001

62
Comparison of statistical techniques (2)

Is there a difference in weight between males and
females after accounting for any difference due
to age
Option 1 - ANCOVA
Option 2 - Multiple linear regression
Both will provide the same answer, that after
adjusting weight for age, there is still a
significant gender effect

63
Assignment

Due in on Monday 12 noon
Before, in break of or immediately after Monday
9-12 lecture
Remember to follow the instructions
Describe the data, choosing the important summary
information relevant for each variable
Use tables or graphs if message clearer
When making comparisons (performing tests)
identify and present only the important
information
Give actual p-values
Give direction of differences and size if
available