Logistic Regression and Discriminant Function Analysis

About This Presentation

Title:

Logistic Regression and Discriminant Function Analysis

Description:

Requires an estimation and validation sample to assess predictive accuracy ... of the following variables predict whether a woman is hired to be a Hooters girl? ... – PowerPoint PPT presentation

Number of Views:1473

Avg rating:3.0/5.0

Slides: 82

Provided by: Michael2154

Category:

more less

Transcript and Presenter's Notes

Title: Logistic Regression and Discriminant Function Analysis

1
Logistic Regression and Discriminant Function
Analysis
2
Logistic Regression vs. Discriminant Function
Analysis

Similarities
Both predict group membership for each
observation (classification)
Dichotomous DV
Requires an estimation and validation sample to
assess predictive accuracy
If the split between groups is not more extreme
than 80/20, yield similar results in practice

3
Logistic Reg vs. Discrim Differences

Discriminant Analysis
Assumes MV normality
Assumes equality of VCV matrices
Large number of predictors violates MV normality?
cant be accommodated
Predictors must be continuous, interval level
More powerful when assumptions are met
Many assumptions, rarely met in practice
Categorical IVs create problems

Logistic Regression
No assumption of MV normality
No assumption of equality of VCV matrices
Can accommodate large numbers of predictors more
easily
Categorical predictors OK (e.g., dummy codes)
Less powerful when assumptions are met
Few assumptions, typically met in practice
Categorical IVs can be dummy coded

4
Logistic Regression

Outline
Categorical Outcomes Why not OLS Regression?
General Logistic Regression Model
Maximum Likelihood Estimation
Model Fit
Simple Logistic Regression

5
Categorical Outcomes Why not OLS Regression?

Dichotomous outcomes
Passed / Failed
CHD / No CHD
Selected / Not Selected
Quit/ Did Not Quit
Graduated / Did Not Graduate

6
Categorical Outcomes Why not OLS Regression?

Example Relationship b/w performance and turnover

Line of best fit?!
Errors (Y-Y) across
values of performance (X)?

7
Problems with Dichotomous Outcomes/DVs

The regression surface is intrinsically
non-linear
Errors assume one of two possible values, violate
assumption of normally distributed errors
Violates assumption of homoscedasticity
Predicted values of Y greater than 1 and smaller
than 0 can be obtained
The true magnitude of the effects of IVs may be
greatly underestimated
Solution Model data using Logistic Regression,
NOT OLS Regression

8
Logistic Regression vs. Regression

Logistic regression predicts a probability that
an event will occur
Range of possible responses between 0 and 1
Must use an s-shaped curve to fit data
Regression assumes linear relationships, cant
fit an s-shaped curve
Violates normal distribution
Creates heteroscedascity

9
Example Relationship b/w Age and CHD (1 Has
CHD)
10
General Logistic Regression Model

Y (outcome variable) is the probability that
having one outcome or another based on a
nonlinear function of the best linear combination
of predictors
Where
Y probability of an event
Linear portion of equation (a b1x1) used to
predict probability of event (0,1), not an end in
itself

11
The logistic (logit) transformation

DV is dichotomous? purpose is to estimate
probability of occurrences (0, 1)
Thus, DV is transformed into a likelihood
Logit/logistic transformation accomplishes
(linear regression eq. takes log of odds)

12
Probability Calculation
Where The relation b/w logit (P) and X is
intrinsically linear b expected change of
logit(P) given one unit change in X a
intercept e Exponential
13
Ordinary Least Squares (OLS) Estimation

Purpose is obtain the estimates that would best
minimize the sum of squared errors, sum(y-y)2
The estimates chosen best describe the
relationships among the observed variables (IVs
and DV)
Estimates chosen maximize the probability of
obtaining the observed data (i.e., these are the
population values most likely to produce the data
at hand)

14
Maximum Likelihood (ML) estimation

OLS cant be used in logistic regression because
of non-linear nature of relationships
In ML, the purpose is to obtain the parameter
estimates most likely to produce the data
ML estimators are those with the greatest joint
likelihood of reproducing the data
In logistic regression, each model yields a ML
joint probability (likelihood) value
Because this value tends to be very small (e.g.,
.00000015), it is multiplied by -2log
The -2log transformation also yields a statistic
with a known distribution (chi-square
distribution)

15
Model Fit

In Logistic Regression, R R2 dont make sense
Evaluate model fit using the -2log likelihood
(-2LL) value obtained for each model (through ML
estimation)
The -2LL value reflects fit of model used to
compare fit of nested models
The -2LL measures lack of fit extent to which
model fits data poorly
When the model fits the data perfectly, -2LL 0
Ideally, the -2LL value for the null model (i.e.,
model with no predictors, or intercept-only
model) would be larger than then the model with
predictors

16
Comparing Model Fit

The fit of the null model can be tested against
the fit of the model with predictors using
chi-square test

Where
?2 chi-square for improvement in model fit
(where df kNull kModel)
-2LLMO -2 Log likelihood value for null model
(intercept-only model)
-2LLM1 -2 Log likelihood value for hypothesized
model
Same test can be used to compare nested model
with k predictor(s) to model with k1 predictors,
etc.
Same logic as OLS regression, but the models are
compared using a different fit index (-2LL)

17
Pseudo R2

Assessment of overall model fit
Calculation
Two primary Pseudo R2 stats
Nagelkerke less conservative
preferred by some because max 1
Cox Snell more conservative
Interpret like R2 in OLS regression

18
Unique Prediction

In OLS regression, the significance tests for the
beta weights indicate if the IV is a unique
predictors
In Logistic regression, the Wald test is used for
the same purpose

19
Similarities to Regression

You can use all of the following procedures you
learned about OLS regression in logistic
regression
Dummy coding for categorical IVs
Hierarchical entry of variables (compare changes
in classification significance of Wald test)
Stepwise (but dont use, its atheoretical)
Moderation tests

20
Simple Logistic Regression Example

Data collected from 50 employees
Y success in training program (1 pass 0
fail)
X1 Job aptitude score (5 very high 1 very
low)
X2 Work-related experience (months)

21
Syntax in SPSS
DV
LOGISTIC REGRESSION PASS /METHOD ENTER APT
EXPER /SAVE PRED PGROUP /CLASSPLOT /PRINT
GOODFIT /CRITERIA PIN(.05) POUT(.10)
ITERATE(20) CUT(.5) .
IVs
22
Results

Block O The Null Model results
Cant do any worse than this
Block 1 Method Enter
Tests of the model of interest
Interpret data from here

Tests if model is significantly better than the
null model. Significant chi-square means yes!
Step, Block Model yield same results because
all IVs entered in same block
23
Results Continued
-2 Log Likelihood an index of fit - smaller
number means better fit (Perfect fit 0) Pseudo
R2 Interpret like R2 in regression Nagelkerke
preferred by some because max 1, Cox Snell
more conservative estimate uniformly
24
Classification Null Model vs. Model Tested
Null Model 52 correct classification
Model Tested 72 correct classification
25
Variables in Equation
B ? effect of one unit change in IV on the log
odds (hard to interpret) Odds Ratio (OR) ?
Exp(B) in SPSS more interpretable one unit
change in aptitude increases the probability of
passing by 1.7x Wald ? Like t test, uses
chi-square distribution Significance ? to
determine if wald test is significant
26
Histogram of Predicted Probabilities
27
To Flag Misclassified Cases

SPSS syntax
COMPUTE PRED_ERR0.
IF LOW NE PGR_1 PRED_ERR1.
You can use this for additional analyses to
explore causes of misclassification

28
Results Continued
An index of model fit. Chi-square compares the
fit of the data (the observed events) with the
model (the predicted events). The n.s. results
means that the observed and expected values are
similar ? this is good!
29
Hierarchical Logistic Regression

Question Which of the following variables
predict whether a woman is hired to be a Hooters
girl?
Age
IQ
Weight

30
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Cox Snell .002 Nagelkerke .003
Block 2. Age
Cox Snell .264 Nagelkerke .353
Block 3. Weight
Cox Snell .296 Nagelkerke .395
31
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Block 2. Age
Block 3. Weight
32
Simultaneous v. Hierarchical
Block 1. IQ
Block 1. IQ, Age, Weight
Block 2. Age
Block 3. Weight
33
Multinomial Logistic Regression

A form of logistic regression that allows
prediction of probability into more than 2 groups
Based on a multinomial distribution
Sometimes called polytomous logistic regression
Conducts an omnibus test first for each predictor
across 3 groups (like ANOVA)
Then conduct pairwise comparisons (like post hoc
tests in ANOVA)

34
Objectives of Discriminant Analysis

Determining whether significant differences exist
between average scores on a set of variables for
2 a priori defined groups
Determining which IVs account for most of the
differences in average score profiles for 2
groups
Establishing procedures for classifying objects
into groups based on scores on a set of IVs
Establishing the number and composition of the
dimensions of discrimination between groups
formed from the set of IVs

35
Discriminant Analysis

Discriminant analysis develops a linear
combination that can best separate groups.
Opposite of MANOVA
In MANOVA, groups are usually constructed by
researcher and have clear structure (e.g., a 2 x
2 factorial design). Groups IVs
In discriminant analysis, the groups usually
have no particular structure and their formation
is not under experimental control. Groups DVs

36
How Discrim Works

Linear combinations (discriminant functions) are
formed that maximize the ratio of between-groups
variance to within-groups variance for a linear
combination of predictors.
Total discriminant functions groups 1 OR
of predictors (whichever is smaller)
If more than one discriminant function is
formed, subsequent discriminant functions are
independent of prior combinations and account for
as much remaining group variation as possible.

37
Assumptions in Discrim

Multivariate normality of IVs
Violation more problematic if overlap between
groups
Homogeneity of VCV matrices
Linear relationships
IVs continuous (interval scale)
Can accommodate nominal but violates MV normality
Single categorical DV
Results influenced by
Outliers (classification may be wrong)
Multicollinearity (interpretation of coefficients
difficult)

38
Sample Size Considerations

Observations Predictors
Suggested 20 observations per predictor
Minimum required 5 observations per predictor
Observations Groups (in DV)
Minimum smallest group size exceeds of IVs
Practical Guide Each group should have 20
observations
Wide variation in group size impacts results
(i.e., classification is incorrect)

39
Example
In this hypothetical example, data from 500
graduate students seeking jobs were examined.
Available for each student were three predictors
GRE(VQ), Years to Finish the Degree, and Number
of Publications. The outcome measure was
categorical Got a job versus Did not get a
job. Half of the sample was used to determine
the best linear combination for discriminating
the job categories. The second half of the sample
was used for cross-validation.
40
DISCRIMINANT /GROUPSjob(1 2) /VARIABLESgre
pubs years /SELECTsample(1) /ANALYSIS ALL
/SAVECLASS SCORES PROBS /PRIORS SIZE
/STATISTICSMEAN STDDEV UNIVF BOXM COEFF RAW
CORR COV GCOV TCOV TABLE CROSSVALID
/PLOTCOMBINED SEPARATE MAP /PLOTCASES
/CLASSIFYNONMISSING POOLED .
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Interpreting Output

Boxs M
Eigenvalues
Wilks Lambda
Discriminant Weights
Discriminant Loadings

48
(No Transcript)
49
Violates Assumption of Homogeneity of VCV
matrices. But this test is sensitive in general
and sensitive to violations of multivariate
normality too. Tests of significance in
discriminant analysis are robust to moderate
violations of the homogeneity assumption.
50
(No Transcript)
51
Discriminant Weights
Data from both these outputs indicate that one of
the predictors best discriminates who did/did not
get a job. Which one is it?
Discriminant Loadings
52
This is the raw canonical discriminant function.
The means for the groups on the raw canonical
discriminant function can be used to establish
cut-off points for classification.
53
Classification can be based on distance from the
group centroids and take into account information
about prior probability of group membership.
54
(No Transcript)
55
Two modes?
56
(No Transcript)
57
Violation of the homogeneity assumption can
affect the classification. To check, the analysis
can be conducted using separate group covariance
matrices.
58
No noticeable change in the accuracy of
classification.
59
Discriminant Analysis Three Groups
The group that did not get a job was actually
composed of two subgroupsthose that got
interviews but did not land a job and those that
were never interviewed. This accounts for the
bimodality in the discriminant function scores.
The discriminant analysis of the three groups
allows for the derivation of one more
discriminant function, perhaps indicating the
characteristics that separate those who get
interviews from those who dont, or, those who
have successful interviews from those whose
interviews do not produce a job offer.
60
Remember this?
Two modes?
61
(No Transcript)
62
(No Transcript)
63
DISCRIMINANT /GROUPSgroup(1 3)
/VARIABLESgre pubs years /SELECTsample(1)
/ANALYSIS ALL /SAVECLASS SCORES PROBS
/PRIORS SIZE /STATISTICSMEAN STDDEV UNIVF
BOXM COEFF RAW CORR COV GCOV TCOV TABLE
CROSSVALID /PLOTCOMBINED SEPARATE MAP
/PLOTCASES /CLASSIFYNONMISSING POOLED .
64
(No Transcript)
65
Separating the three groups produces better
homogeneity of VCV matrices. Still significant,
but just barely. Not enough to worry about.
66
Two significant linear combinations can be
derived, but they are not of equal importance.
67
Weights
What do the linear combinations mean now?
Loadings
68
(No Transcript)
69
(No Transcript)
70
Loadings
Weights
71
This figure shows that discriminant function 1,
which is made up of number of publications and
years to finish, reliably differentiates between
those who got jobs, had interviews only, and had
no job or interview. Specially, a high value on
DF1 was associated with not getting a job,
suggesting that having few publications (loading
-.466) and taking a long time to finish
(loading .401) was associated with not getting
a job.
72
(No Transcript)
73
(No Transcript)
74
Territorial Map Canonical Discriminant Function
2 -6.0 -4.0 -2.0 .0
2.0 4.0 6.0
ôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòò
ôòòòòòòòòòô 6.0 ô
23 31 ô ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31
ó 4.0 ô ô ô 23 ô
31ô ô ô ó
23 31 ó
ó 23 31
ó ó
23 31 ó
ó 23 31
ó ó
23 31 ó 2.0
ô ô ô 23 ô 31
ô ô ó
23 31 ó ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó .0 ô
ô ô 23 ô 31
ô ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31 ó
ó 23
31 ó -2.0 ô ô
23 ô ô31 ô ô
ó 23 31
ó ó 23
31 ó
ó 23 31
ó ó 23
31 ó ó
23 31
ó -4.0 ô ô 23 ô
ô ô 31 ô ô ó
23 31
ó ó 23
31 ó ó
23 31
ó ó 23
31 ó ó
23 31 ó
-6.0 ô 23
31 ô ôòòòòòòòòòôòòòòòòòò
òôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòôòòòòòòòòòô
-6.0 -4.0 -2.0 .0 2.0
4.0 6.0 Canonical
Discriminant Function 1 Symbols used in
territorial map Symbol Group Label ------
----- -------------------- 1 1
Unemployed 2 2 Got a Job 3 3
Interview Only Indicates a group
centroid
75
(No Transcript)
76
Classification
A classification function is derived for each
group. The original data are used to estimate a
classification score for each person, for each
group. The person is then assigned to the group
that produces the largest classification score.
77
(No Transcript)
78
Is the classification better than would be
expected by chance? Observed values
79
Expected classification by chance E (Row x
Column)/Total N
80
Correct classification that would occur by chance
81
The difference between chance expected and actual
classification can be tested with a chi-square as
well.
145.13 13.82 23.47 14.48 59.25 8.77
25.5 11.28 29.34
Chi squared 331.04
Where degree of freedom ( groups -1)2 df 4

Write a Comment

User Comments (0)