Title: Clustering and Grouping Analysis
1Clustering and Grouping Analysis
Victor M. H. Borden, Ph.D. Associate Vice
Chancellor, Information Management and
Institutional ResearchAssociate Professor of
PsychologyIndiana University Purdue University
Indianapolis vborden_at_iupui.edu
- Methods for Identifying Groups and Determining
How Groups Differ
2Purpose
- To provide a working knowledge
- Logistic Regression
- Discriminant Analysis
- Cluster Analysis
- To demonstrate their application to common
institutional research issues - student market segmentation
- student retention
- faculty workload
3Learning Objectives
- Understand the fundamental concepts of logistic
regression, cluster analysis and discriminant
analysis - Determine when to use appropriate variations of
each technique - Understand the data requirements for performing
these analyses - Use SPSS software to perform basic logistic,
cluster and discriminant analyses
4Learning Objectives
- Know how to interpret the tabular and graphical
outputs of these procedures to evaluate the
validity and reliability of various solutions - Prepare reports on the results of these analysis
for professional or lay audiences - Understand the relationship between these and
related statistical methods
5Workshop Pre-requisites
- Basic Statistics
- General Linear Models
- Statistical Software
- Institutional Research
6Workshop Method
- Introduction to basics using an IR example
- On your own exercise using IR datasets
- Discussion of methods and issues as you
experience them
7Workshop Schedule Day 1
- Introduction and overview (15)
- Logistic regression
- Basic concepts with example (45)
- On your own examples (30)
- Break (30)
- Discriminant analysis
- Basic concepts with example (30)
- On your own examples (30)
- Logistic regression v. discriminant analysis
8Workshop Schedule Day 2
- Cluster Analysis
- Basic concepts with example (45)
- Example Peer institution identification (30)
- Break (30)
- Decision Tree Techniques
- Basic concepts with example (30)
- Free play (45)
9Overview
- Analyzing differences among existing groups
- Extending the regression model to look at a
(dichotomous) group outcome variable - Logistic regression
- Discriminant analysis
- Identifying groups out of whole cloth
- Cluster analysis
- Focus on proximity aspect
- Decision trees as a hybrid model
- CHAID
10Workshop Datasets
11Existing Group Differences
- The outcome of interest is membership in a group
- Retained vs. non-returning students
- Admits who matriculate vs. those who dont
- Alums who donate vs. those who dont
- Faculty who get grants vs. those who dont
- Institutions that get sanctioned for assessment
on their accreditation visits vs. those who dont - Class sections that meet during the day vs.
evening
12Three Basic Questions
- Which, if any, of the variables are useful for
predicting group membership? - What is the best combination of variables to
optimize predictions among the original sample? - How useful is that combination for classifying
new cases?
13One or More Groups
- Group outcomes can be dichotomous or
polychotomous - Logistic regression and discriminant analysis
can handle both - We will focus on dichotomous case with only lip
service to polychotomous situation
14Examining Group Differences
- Why not t-test or ANOVA?
- Group factor is dependent variable (outcome) not
independent variable (predictor, causal agent,
etc.) - No random assignment to group
- But we always violate that assumption
- Requires normal distribution of outcome
15The Linear Regression Problem
- Group membership as the outcome (dependent)
variable violates an important assumption that
has serious consequences - Under certain conditions the problems are not
completely debilitating - Group membership evenly distributed
- Predictors are all solid continuous/normal
variables
16Two Regression-Based Solutions
- Logistic regression
- Transforms outcome into continuous odds ratio
- Readily accommodates continuous and categorical
predictors - Interpretations differ substantially from OLS
linear form - Includes classification matrix
- Discriminant analysis
- Uses standard OLS procedures
- Requires continuous/normal predictors
- Interpretations similar to OLS linear regression
- Includes classification matrix
17Remember OLS Linear Form
- Finding linear equation that fits pattern best
18OLS Linear Form
- Overall fit of model
- Significance of model
- Predictor (b) coefficients
19The Group Outcome Problem
- Y equals either 0 or 1
- Predictions can be gt 1 or lt 0
- Coefficients may be biased
- Heteroscedasticity is present
- Error term is not normally distributed so
hypothesis tests are invalid
20The Logistic Regression Solution
- Use a different method for estimating parameters
- Maximum Likelihood Estimates (MLE) instead of OLS
- Maximizes the probability that predicted Y equal
observed Y - Transforms the outcome variable into a form that
is continuous/normal - The natural log of the odds ratio or logit
- Ln(P/(1-P)) a b1x1 b2x2
21The Odds Ratio
- The probability of being in one group relative to
the probability of being in the other group - If P(group 1) .5, the odds ratio is 1 (.5/.5)
- If the retention rate is 80, the odds ratio is
0.8/0.2 4 (odds of 4 to 1 of being retained) - If the yield rate is 67, the odds ratio is
0.67/0.33 2 or 2 to 1. - If 12.5 of of alums donate, the odds ratio is
0.125/0.875 .143 or 1 to 7
22Predictors and Coefficients
- Predictors can be continuous or categorical
(dummy) variables - Coefficient shows the change in the Ln(P(1-P))
for unit change in predictor - Can be converted into marginal effects effect on
probability that Y1 for unit change in X - Not easy to explain but
- Can talk in general terms (positive, negative,
zero) - Have classification statistics that are more
intuitive
23Logistic Regression in SPSS
24Retention Example Output
- Ominbus Tests
- Model Summary
- Classification table
25Retention Example Output
26Interpreting Logistic Reg Output
- Ominbus Tests
- Overall significance of model
- Relative performance of one model v. another
- Model summary
- Goodness of fit R2 Statistics
- Several versions, none of which are real R2
- Classification table
- Ability to successfully classify from prediction
- Remember prediction is probability that then has
to be categorized
27Interpreting Coefficients
- B value is change in ln(odds ratio) for unit
change in predictor - S.E. is error in predictor
- Relates to significance and estimation
- Wald Statistic like the t-value in OLS Linear
- Has corresponding significance level
- Can be incorrect for large coefficients
- Exp(B) is marginal effect
- effect on probability that Y1 for unit change in
X
28Interpreting Coefficients
A unit change in SATACT increases the odds that
the student will be retained by a factor of 1,
that is, not at all
A full-time student is twice as likely to be
retained then a part-time student
A unit (full grade) change in GPA increases by
more than a factor of 2 the likelihood that a
student will be retained
29On Your Own
- Try different variables or entry methods on
retention data - Predict admissions yield status with application
data set - Predict full- vs. part-time faculty status with
faculty data set - Distinguish between two Carnegie categories on
institutional data set - Dont forget to select only two groups
30Questions and Answers
31Discriminant Analysis
- Closer to linear regression in form and
interpretation - Predictors must by continuous or dichotomous
- Logistic regression can handle polychotomous
categorical variables - Can be used for multi-group outcome
- Generates k-1 orthogonal solutions for k groups
32Requirements for Discriminant
- Two or more groups
- At least two cases per group
- Any number of discriminating variables, provided
that it is less than the total number of cases
minus 2 - Discriminating variables are on an interval or
ratio scale
33Requirements for Discriminant
- No discriminating variable can be a linear
combination of any other discriminating variable - The covariance matrices for each group must by
approximately equal, unless special formulas are
used - Each group had been drawn from a population with
a multivariate normal distribution on the
discriminating variables
34Interpreting Group Differences
- Which variables best predict differences
- What is the best combination of predictors The
canonical discriminant function
35The Discriminant Function
- Derivation
- Like regression, maximize the between-groups sum
of squares, relative to the within-groups sum of
squares for the value D - Interpretation
- Overall function statistics
- Predictor variable statistics
36Retention Example
- Overall model statistics
- Eigenvalue/Canonical Correlation
- Wilks Lambda
- 1 Canonical2
37Predictor Coefficients
- Standardized Discriminant Coefficients
- Variables with largest (absolute) coefficient
contribute most to prediction of group membership - Sign is direction of effect
38Retention Coefficients
- Structure coefficients
- Correlation between each predictor and overall
discriminant function
39Classification in Discrminant
- Prior probabilities
- Can be .5 or set by size of group
40Classification in Discriminant
- Accuracy within each group as well as overall
41The Classification Matrix
- Comparing actual group membership against
predicted group membership, using the
classification function - Can have an "unknown" region
- Split samples can (should?) be used to further
test the accuracy of classification
42The Classification Matrix
- Measures of interest include
- Overall prediction accuracy
- (ad)/N
- Sensitivity-accuracy among positive cases
- a/(ac)
- Specificity-accuracy among negative cases
- d/(bd)
- False positive rate
- b/(bd)
- False negative rate
- c/(ac)
43The Confusion Matrix
- But thats not all
- Prevalence (ac)/N
- Overall diagnostic power (bd)/N
- Positive predictive power a/(ab)
- Negative predictive power d/(cd)
- Misclassification rate (bc)/N
- Odds ratio (ad)/(bc)
- Kappa (a d) - (((a c)(a b) (b d)(c
d))/N) N - (((a c)(a b) (b d)(c
d))/N) - NMI 1 - -a.ln(a)-b.ln(b)-c.ln(c)-d.ln(d)(ab).ln(
ab)(cd).ln(cd) N.lnN -
((ac).ln(ac) (bd).ln(bd))
44Adjusting for Prior Probabilities or the "Costs"
of Misclassification
- Methods so far have considered each group equally
- One can take into account known differences in
group composition - This usually takes one of two forms
- Prior information regarding the likely
distribution of group size - There are known higher "costs" of misclassifying
objects into one group compared to in the other
45.5 v. Group Size Cutoffs-Discrim
46.5 v. Group Size Cutoffs-Logistic
47Logistic v. Discrim Classification
- Classification measure calculator for 2x2
- http//members.aol.com/johnp71/ctab2x2.html
48On Your Own
- Rerun your logistic regressions as discriminant
analyses - Play with different cutoff conditions
- .5 vs. predicted from group size for discriminant
- Set your own value for logistic regression
49Questions and Answers
50Logistic vs. Discriminant
- Logistic
- Accommodates categorical predictors with gt 2
groups - Fewer assumptions
- More robust
- Easier to use?
- Discriminant
- Easier to interpret?
- More classification features
- Can accommodate costs of misclassification
51Reporting Results
- Logistic regression coefficients (and their
anti-logs) are difficult to convey graphically - Positive impact values above 1 range
considerably - Negative impact values below 1 have a limited
range - Delta P is an alternative
- Change in probability of outcome given unit
change in predictor
52Reporting Results
- Classification table and some of the related
measures are usually most effective way to convey
usefulness of results - As with all higher level analyses, the most
important point is to interpret in context of
real decisions - E.g., impact of changing selection index cutoff
in terms of entering class size and predicted
change in retention rate
53Some Reasonable Examples
- Smith and Nielsen, Longwood College
- http//www.longwood.edu/assessment/Retention_Analy
sis.htm - DePauw University
- http//www.depauw.edu/admin/i_research/research/ye
ar1_02supp.pdf
54Good Night!
- Read Chapter 5 of RIR Stats Volume
- To reinforce lessons for today and tomorrow
- If you are having trouble falling asleep
55Cluster Analysis
- Any of a wide variety of numerical procedures
that can be used to create a classification
scheme - Conceptually easy to understand and well suited
to segmentation studies - It is a heuristic algorithm, not supported by
extensive statistical reasoning - It is entirely data driven
- Sometimes yields inconsistent results
56Cluster Analysis
- Creating groups out of whole cloth
- Drawing circles around points scattered in
n-dimensional space
57What Is a Cluster?
- A set of objects or points, that are relatively
close to each other and relatively far from
points in other clusters. - This view that tends to favor spherical clusters
over ones of other shapes
58Steps to Cluster Analysis
- Selecting variables
- Selecting a similarity or distance measure
- Choosing a clustering algorithm
59Selecting Variables
- Most popular forms based on measures of
"similarity" according to some combination of
attributes - The choice of variables is one of the most
critical steps - Should be guided by an explicit theory or at
least solid reasoning - Higher education researchers typically have ready
access to certain types of student
characteristics
60Choosing a Similarity Measure
- Distance measures Spatial relationship
- Association measures Similarities or
dissimilarities, using measures of association
(e.g., Correlation, contingency tables) - The type of variable constrains the choice
- Nominal variables require either association
coefficients or a decision-tree technique - Continuous variables lend themselves to
distance-type measures
61Distance-Type Measures
- Several are cases of what is called the Minkowski
metric. - Euclidean distance (r 2 distance between two
points in n-dimensional space) - City-block metric (r 1 the sum of differences
along each measure
62Distance-Type Measures
- Another common distance measure is Mahalanobis
D2, which takes into account correlations among
the predictors
63Standardized vs. Unstandardized Measures
- One must be careful about the implications of
using standardized vs. unstandardized measures in
computing these distances.
64Matching-Type Measures
- Association coefficients
- The only game in town when the predictors are
nominally scaled. - The predictor variables are usually converted to
binary indicators. - Similarity Coefficients are a form of matching
type measure based on a series of binary
variables that represent the presence or absence
of a trait.
65Contingency Table-based Similarity Coefficients
- Possible coefficients that differ according to
- How negative matches (0,0) are incorporated
- Whether matched pairs are equally weighted or
doubled - Whether unmatched pairs carry twice the weight of
matched pairs - Whether negative matched are excluded altogether
66Contingency-Table Measures
- (ad)/(abcd) matching coefficient
- a/(abcd) Russel/Rao index
- a/(abc) Jaccard coefficient
- 2a/(2abc) Dices coefficient
67Correlation Coefficients
- Pearson r, Spearman r, etc.
- Correlation is across variables and between each
pair of objects
Across Variable Correlation
Standard Across Person Correlation
68The Distance Matrix
- Regardless of method, the first step in cluster
analysis is to produce a distance matrix. - A row and column for each object
- Cells represent the distance or similarity
measure between each pair. - Symmetric with diagonal of 0's for distance
matrices, or 1's for similarity measures. - This is what makes cluster analyses like these so
computationally intensive.
69Choosing a Clustering Algorithm
- Hierarchical algorithms
- Agglomerative methods start with each object in
its own cluster and then merges points and
clusters until some criteria is reached - Single linkage (nearest neighbor)
- Complete linkage (furthest neighbor)
- Average linkage
- Wards error sum of squares
70Choosing a Clustering Algorithm
- Hierarchical algorithms (continued)
- Divisive methods start with one group of the
whole and partitioning objects into smaller
clusters until some criteria is reached. - Splinter-average distance
- Decision tree methods
- Partitioning algorithms
- K-means clustering
- Trace-based methods
71Peer Institution Example
- Variables Derived from IPEDS
72Create Proximity Matrix
- Screen institutions to a manageable number (lt
300) - Select ClassifyHierarchical Cluster
- Place predictors in Variable box
- Under Method, choose Z scores for standardize
box (by variable). - Paste syntax
- Erase Cluster procedure
- Change proximity matrix file name so you can find
it - Run it
73Using Proximity Matrix
- Find target institution (sort by name)
- Identify Varname and find target column
- Get rid of excess columns
- Sort (ascending) by varname column
- VOILA! Institutions now sorted by similarity to
target
74Graphical Clustering Methods
- Glyphs, Metroglyphs, Fourier series, and Chernoff
faces
75Decision Trees
- Hybrid between clustering and discriminant
analysis - The criterion variable does not define the groups
- But the groups are defined so as to maximize
differences according to the criterion. - The purpose is to identify key variables for
distinguishing among groups and formulating group
membership prediction rules.
76Functions of Decision Trees
- Derive decision rules from data
- Develop classification system to predict future
observations - Illustrate these through a decision tree
- Discretize continuous variables
77SPSS AnswerTree
- Three decision tree algorithms
- All use "brute force" methods
78Common Features
- Merging categories of the predictor variables so
that non-significantly different values are
pooled together - Splitting the variables at points that maximizes
differences - Stopping the branching when further splits do not
contribute significantly - Pruning branches from an existing tree
- Validation and error estimation
79CHAID
- Not just binary splits
- Handles nominal, ordinal, and continuous
variables - Useful for discretizing continuous variables
- Demo and sample output included within session
support materials
80(No Transcript)
81Playing with CHAID, etc.
- Work with retention dataset
- Use retention status as criteria
- Use Semester GPA as criteria
- Try newer institutional dataset
- Use graduation rate as criteria
- Try nearest neighbor analysis with newer data
82Final Thoughts
- Flexibility of logistic regression models make it
the coin of the realm - E.g., Multinomial logistic HLM regression
- Cluster analysis is so data driven as to make its
use fairly limited - Threshold approach to peer identification much
more popular but its always good to run things
multiple ways (See IR Primer Peer Chapter) - CHAID is fun to play with and informative
83(No Transcript)