Title: Logistic Regression
1Logistic Regression
2Outline
- Final Exam initial questions
- Research Projects
- Action Plan
- Format (what are they expected to look like?)
- When and Why do we Use Logistic Regression?
- Transforming the dependent variable
- Interpreting the coefficients
3Research Project Action Plan
Read literature on your topic
Find dataset
Find variables
Are variables appropriate measures of the
concepts in your hypotheses?
Propose hypotheses
Decide the level of measurement of your variables
Recode variables
Conduct statistical tests
Decide what statistical tests you will do for
each hypothesis / combination of variables
Examine each variable you intendto use
(descriptive statistics)
Interpret statistical tests
Have you proved or disproved each hypothesis?
Conduct additional tests
Are there things you want to follow up or
investigate further?
Write up project
4Logistic Regression When And Why
- To predict outcome variable that is a categorical
dichotomy from one or more categorical or
continuous predictor variables.. - Used because having a categorical dichotomy as an
outcome variable violates the assumption of
linearity in normal regression.
5The Problem with dichotomous dependent variables
- Are variables which take two values. Usually
coded 1 and 0 - i.e. being unemployed or not being pregnant or
not smoking or not - Linear regression will not work because there is
no range of scores everyone will either be 1 or
0. - Therefore, instead of regressing the values, we
regress the probability of taking one of the two
values (or of being in the 1 category) - i.e. the probability of being unemployed of
being pregnant of smoking - This gives us a range of values as probabilities
range from 0 to 1. - However this still presents a problem for linear
regression. As we are likely to get an impossible
probability sometimes
6Going from a Probabilityto a Logit 1
- Probabilities (that Y1) range from 0 to 1.
- For example if the probability of smoking was .48
- The probability of an event not happening is 1
minus the probability of the event 1 p(Y1). - The probability of not smoking would be 1-.48.52
- The odds of an event happening is the probability
of it happening divided by the probability of it
not happening. - So the odds of smoking is 0.48/0.520.92.
- Odds range from 0 to infinity (?).
- Probabilities greater than .5 produce odds
between 1 and ?. - Probabilities less than .5 produce odds between 0
and 1. - This means that we cannot get a number larger
than is reasonable, but can still get one that is
smaller than is reasonable (i.e. less than 0).
7Going from a Probabilityto a Logit 2
- In order to deal with this problem we take the
natural logarithm of the odds that Y1. This is
referred to as logit (Y). If we use ln to stand
it for natural log, the equation for logit (Y)
is - Note the natural logarithm expresses numbers to
base 2.72 (to an infinite number of decimal
places). This means that the natural log of 2.72
is 1 and the natural log of 2.722 is 2, etc. - This transformation stretches the lower values of
the odds that Y1 so that the linear equation
does not predict impossibly low values. (As the
odds decrease from 1 to 0 the logit value becomes
negative and increasingly large, going to ?).
8Relationship between Probability and Logit
There is a non-linear relationship between p and
its logit. In the mid range of p there is a
linear relationship, but as p approaches the 0 or
1 extremes the relationship becomes non-linear
with increasingly larger changes in logit for the
same change in p. This means that instead of
having a dependent variable that has a minimum of
0 and a maximum of 1, we have a dependent
variable with a minimum of -? and a maximum of ?.
This means that it will not be possible to have a
result that is beyond the range of possible
values.
P
9The logistic regression equation
- The logistic regression equation can be arranged
in a linear form (like a regression equation) - log Prob(event)/Prob(no event) a b1x1
b2x2 ... bPxP
10Converting back to probabilities
- Since the coefficients in logistic regression are
not easily interpretable (unlike linear
regression) we convert values of logit (Y) back
to the more meaningful values of odds and
probabilities. - To obtain the odds that Y 1 we unlog logit
(Y). This is done by taking the anti-log (or
exponent, written as e). The equation is - Odds (Y1) e a bX
- To get back to the probability that Y 1 we can
reverse the calculation that turned the
probability into odds - The probability that Y 1 e a bX divided by
1 ea bX
11Example
- If we look at the effects of stress on smoking
- We have an equation for which we get the results
- Logit(Y) a bx
- Logit(Y) -.08987 0.1638x
- If x (stress) is low it would be scored as 1.
Therefore - Logit(Y) -.08987 (1 x 0.1638) -0.735
- Therefore the odds of smoking will be
- Odds(smoker1) e-0.735 0.48
- This can be interpreted as saying that the
respondents reporting very low stress are about
half as likely to smoke as not smoke. - The probability that they smoke will be
- Probability of smoking odds of smoking ? (1
odds of smoking) 0.33 or 33
12Example contd.
- On the other hand, if the respondent reported a
very high level of stress (i.e. 10) his or her
estimated probability of smoking will
be Logit(smoker) -0.8987 (10 x 0.1638)
0.738 Odds(smoker1) e0.738 2.09 - This indicates that the odds of being a smoker
are just over twice as high as those of not being
a smoker. - And the probability that a highly stressed person
will smoke is - Probability of smoking odds of smoking ? (1
odds of smoking) 0.68 or 68