Title: Revision of last lecture
1Revision of last lecture
2Relative Risk
- For cohort studies the relative risk is used.
- The relative risk is the risk or rate of disease
in exposed group relative to the risk or rate of
disease in the unexposed group. - The prevalence of disease can be found using a
cohort study - A RR of 1 means that the risk of the event is
equal in the groups being compared. - A RR gt1 suggests the event (disease) is more
common in the exposed group than the non-exposed - An odds ratio lt1 suggests the event is less
common in the exposed group than the non-exposed
3Odds ratio
- For case-control studies the odds ratio is most
often used to show relationships. - The odds ratio is the ratio of the odds of
exposure in the diseased group compared to the
odds of exposure in the non-diseased group. - An odds ratio of 1 means that the odds of the
event is equal in the groups being compared. - An odds ratio gt1 suggests the event (exposure) is
more common in the diseased group than the
non-diseased - An odds ratio lt1 suggests the event is less
common in the diseased group than the
non-diseased
4General comments
- The odds, risks and rates are the probabilities
or chances of an individual having a particular
event such as death or an illness. - The odds ratio and the relative risk are very
similar when a disease or exposure is rare.
5Transformations and logistic regression
6Two separate topics
- Two entirely different methods for entirely
different situations - Transformations
- if continuous health outcome data are not
Normally distributed - if relationship between two continuous / discrete
variables is not linear, e.g. curve - Logistic regression
- if health outcome is binary
7Why transform?
- Normalise the distribution of the data
- Eliminate heterogeneity of variance
(non-constant variance) - To linearise a relationship between two
continuous/ discrete variables
8Types of transformations
- Logarithmic (can use natural log also known as
loge, or log to the base 10, log10). - Square root (?x)
- Power (x2, x3)
- Inverse or reciprocal (1/x)
- Logit
- that is loge (p/1-p) which is the log of the odds
- There are some other transformations but these
are the most common
9Advantage and disadvantage of transformations
- Advantage
- May allow the use of parametric methods
- More informative output
- Higher power
- Disadvantage
- Careful interpretation of the units of the
transformed variable is required
10Transformations
- If the data is positively skewed then we can try
either a logarithmic, reciprocal or square root
transformation depending on the extent of the
problem in order to remedy the situation - Stretches small values squeezes large values
- If the data is negatively skewed the we can try a
power transformation (x2 or x3) depending on the
extent of the problem - Squeezes small values stretches large values
11Logarithmic transformation
- Definition
- U log10 X X 10U 100 102
- U loge X X eU 100 e4.61
U ln X X invlog(U) 100 invlog(4.61)
expU exp4.61
12Logarithmic transformation - Normality
Histogram of waiting time
13Log (waiting time)
14Unequal variability
The standard deviation for group 1 is much larger
than the SD in group 2. T-test assuming equal
variances would be inappropriate
15Following log transformation
- Group Mean SD
- Waiting time (days)
- 1 108.2 63.45
- 2 57.4 37.65
- Log transformed waiting time (log days)
- 1 4.46 0.74
- 2 3.86 0.61
16Transformation and Interpretation
- It is important to note that if there are several
problems with the data we would only need to use
one transformation to attempt to remedy the
situation - The mathematical properties of the log
transformation allow sensible interpretation of
parameter and 95 CI - The other transformations (x2, inverse, etc) do
not have such interpretation
17Interpretation of logarithmic transformation
- Waiting time example
- After log transforming the waiting time data, we
can compute the mean of the log waiting time - We do not want to report the mean waiting time in
log days - Therefore we can back transform the data into the
original units - Following a logarithmic transformation, we
antilog (or exponentiate) the mean of the logged
data - This mean is known as the geometric mean
18Geometric mean
- Waiting time (days)
- Mean 81.0 days median 53.9 days
- Log transformed waiting time (log days)
- Mean 4.17 log days
- Geometric mean exp (4.17) 64.7 days
19Comparison of two groups
- Two independent groups of patients are to be
compared with respect to waiting time - Waiting time is positively skewed
- A log transformation is applied to the data
- An independent groups t-test is then applied to
the log transformed data
20Raw data and log transformed data
21Output from SPSS
- The mean log waiting time is 4.17 in group 1
compared with 3.99 in group 2 - The geometric mean waiting time is exp(4.17)
64.7 days in groups 1 compared to 54.1 days in
group 2
22Independent t-test
- Ho mean (log waiting time) in group 1 mean
(log waiting time) in group 2 - Is there a statistically significant difference
in waiting time?
23Independent t-test
- Ho mean (log waiting time) in group 1 mean
(log waiting time) in group 2 - Is there a statistically significant difference
in waiting time?
24Independent t-test
- Ho mean (log waiting time) in group 1 mean
(log waiting time) in group 2 - Is there a statistically significant difference
in waiting time? - Mean difference in log waiting time 0.1768 log
days with 95 confidence interval for mean
difference (0.08 to 0.27) log days
25Interpretation of t-test
- However, it would be more informative to report
in natural units i.e. days. - We therefore need to exponentiate (inverse log)
results obtained. - First need to consider some mathematics.
- Transformation has altered every value (x) into
log x - Estimate of difference from independent t-test
relates to - log x1 log x2 log (x1 / x2) mathematical
- x1 and x2 are the mean in groups 1 and 2
respectively - The inverse log (exponential) of the mean
difference (in log scale) from the independent
t-test is x1 / x2 - This denotes the ratio of waiting time in group 1
relative to the waiting time in group 2
26- Mean difference 0.1768 log days
- Ratio of means exp(0.1789) 1.20
- 95 confidence interval for mean difference (0.08
to 0.27) log days - 95 confidence interval for ratio of means is
(1.08 to 1.31)
27Transformations for non-linearity
- It is often found that the relationships between
the dependent and explanatory (or independent)
variables are non-linear - There are two approaches to modelling the
relationship - Transform the variable(s) to produce a linear
relationship - Curve fitting techniques
- Polynomial regression
28Transformations for non-linearity
- When you have continuous variables any of the
previous transformations described can be applied
(e.g. logarithmic, square, inverse etc) - When the variable is binary, a logit or probit
transformation can be applied to the data
29Log transformation for non-linearity
30Non-linear relationships
- Curve fitting techniques
- Polynomial regression
- This approach incorporates the fact that the
researcher is aware of the nature of the
relationship between the two variables - The researcher uses this knowledge to decide on
the most appropriate relationship
31Curve fitting techniques
- If the researcher did not know the exact nature
of the relationship between the variables but
they knew that the relationship was non-linear
then curve-fitting techniques could be adopted - Many statistical packages have these facilities
and there are packages whose sole function is to
perform curve-fitting procedures
32Polynomial regression
- This type of regression is used if we have some
idea about the nature of the relationship between
the two variables - It would not be sensible to try and fit a
straight line to data that obviously was
non-linear - A polynomial regression model is of the form
- y ? ?1X ?2X2 ?3X3 ...
33Example
- The rate of photosynthesis of an Antarctic
species of grass was determined for a series of
environmental temperatures - The aim of the researchers was to fit a line of
best fit to this data so that they could use it
to predict the temperature at which
photosynthesis was maximum
34Net photosynthesis rate versus temperature
100
80
60
Photo. rate in
40
20
0
-5
0
5
10
15
20
25
30
35
Temperature in Celsius
35Example quadratic regression
36Example quadratic regression
The coefficient for temperature squared
(-0.249) is statistically significant indicating
that a quadratic relationship appears to fit the
data
37A note of caution
- The regression equation we find for this data is
- y 46.37 6.77 x x - 0.249 x x2
- However when fitting this data in SPSS we will
find that x and x2 are highly correlated - This can lead to collinearity problems
- e.g. strange p-values
- What we need to do is to centre the x-variable
first by subtracting the mean value from the
individual xs first to reduce the high
correlation between x and x2
38Collinearity and centering
39A typical non-linear relationship
- For example population growth can be represented
by the equation - N(t) N(0) ert
- Population at time t, N(t), is related to the
size of the population at time 0, N(0),
multiplied by the exponential rate of growth, r,
and the time period involved, t. - The relationship is obviously non-linear we can
tackle this problem by taking a logarithmic
transformation.
40Tackling non-linear relationships
- This can be linearised by taking the natural logs
of both sides of the equation - Taking the natural log of our population growth
equation this gives - loge N(t) loge N(0) r x t
41Logit transformation
- To examine the relationship between presence (or
absence) of disease with a number of risk factors - To compare two treatment groups with respect to
re-treatment (yes/no) after correcting for
characteristics of the patients (that may be
unequally distributed across the two groups) - To generate a prediction model for recurrence of
hernia based on characteristics of patients,
treatment, etc.
42Binary health outcomes
- Logit transformation
- Logistic regression
- regression for a binary outcome
43Probability of an event
- pi is the probability of a particular outcome
(e.g. success) - 0 ? pi ? 1
- Require a function that will transform pi onto
the range (- ? to ?)
44Logit transformation
- - ? ? logit(p) ? ?
- That is the logit transformation will result in a
variable that can take any value in a range (i.e.
it is continuous) - Logit(p) logep/(1-p)
- Recall, odds p/(1-p)
- Therefore this is the log of the odds
- It is known as the log odds
45Log odds form of the logistic model
- A very useful property of the logistic model is
that by applying a transformation to both sides
of the equation, it becomes linear - pi eab1x1
- (1eab1x1)
- Logit(p) log odds a bixi
- To interpret the coefficients we need to
transform them back to original units (i.e
antilog / exponential)
46Logistic Regression
- This type of regression is used if the dependent
variable is a binary (dichotomous) variable e.g.
presence or absence of disease - Under these circumstances classic multiple
linear regression is unsuitable - This method can be used to compare the
characteristics of subjects with or without a
particular disease - Very commonly used in epidemiological studies
47Example of logistic regression
- Information was available on 111 consecutive
patients admitted to ICU. The researcher wished
to investigate the relationship between the
patients vital status (lived/died) and the
patients characteristics on entry to ICU - Strategy for analysis
- Crosstabulations
- Calculation of odds of an event occurring
- Logistic regression
48Descriptive statistics
49Crosstabulation
50Interpretation of crosstabulation
- Borderline statistically significant association
between type of admission and whether a patient
leaves the ICU alive - A greater proportion of emergency admission
patients died (40) compared to elective
admissions (12) - You can also express the association between type
of admission and vital status using an odds ratio
51Calculation of odds of dying
- The odds of an emergency admission patient dying
whilst in ICU - 38/56 0.679
- The odds of an elective admission patient dying
whilst in ICU - 2/15 0.133
- Therefore, the odds ratio for type of admission
- 0.679 / 0.133 5.11Â
- The odds of a patient dying whilst in ICU is 5
times greater if the patient was admitted as an
emergency rather than an elective patient
52Logistic regression
- Type of surgery (TYPE) 0elective, 1emergency
- The coefficient for TYPE is 1.625. This
indicates the difference in log odds between an
elective and emergency admission is 1.625 - The increase in the ODDS for an emergency
admission is by a factor of exp(1.625) - That is the odds ratio (emergency/elective) is
exp(1.625) 5.079. - This is indicated in the final column of the
table above. - Recall, this is what was computed by hand for the
2x2 table
53How good is the model at predicting outcome?
- The classification table indicates how good the
logistic regression model is at predicting
outcome - The model is better at predicting those who died
and overall, 65 of cases are correctly predicted
using type of admission alone
54Multiple logistic regression
- Logistic regression can be extended to consider a
number of explanatory variables simultaneously - Explanatory variables can be continuous, ordinal
or categorical - SPSS will define dummy variables for you
- The relative contribution of each explanatory
variable can be assessed - Increasing the number of explanatory variables
may improve the prediction capability of the
model - In our ICU example, age of patient is added to
the model - The coefficient for type of admission has now
been adjusted for the effect of age - This results in an adjusted odds ratio
55Multiple logistic model
- Including age in the model has resulted in an
adjusted OR for type 6.5. Thus, after
adjusting for age, the odds of dying for patients
admitted as emergency cases is 6.5 times that of
a patient admitted as an elective patient - After adjusting for age, patients admitted as
emergency admissions are approximately 6.5 times
more likely to die in ICU compared with those
admitted as an elective
56Why transform?
- Normalise the distribution of the data
- Eliminate heterogeneity of variance
(non-constant variance) - To linearise a relationship between two variables
- Log for positively skewed data the most useful
and most easy to interpret
57What is logistic regression?
- Very similar to ordinary regression but
- Dependent variable is binary (instead of
continuous)
58Uses of logistic regression
- Use regression equation to predict probability of
an outcome given values of explanatory variables - Determine which explanatory variables influence
an outcome - Adjust analyses for confounding variables (e.g.
age, gender, etc.)