Clustering and Grouping Analysis

About This Presentation

Title:

Clustering and Grouping Analysis

Description:

Associate Vice Chancellor, Information Management and Institutional Research ... a/(a b c d) Russel/Rao index. a/(a b c) Jaccard coefficient ... – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 84

Provided by: info77

Category:

more less

Transcript and Presenter's Notes

Title: Clustering and Grouping Analysis

1
Clustering and Grouping Analysis
Victor M. H. Borden, Ph.D. Associate Vice
Chancellor, Information Management and
Institutional ResearchAssociate Professor of
PsychologyIndiana University Purdue University
Indianapolis vborden_at_iupui.edu

Methods for Identifying Groups and Determining
How Groups Differ

2
Purpose

To provide a working knowledge
Logistic Regression
Discriminant Analysis
Cluster Analysis
To demonstrate their application to common
institutional research issues
student market segmentation
student retention
faculty workload

3
Learning Objectives

Understand the fundamental concepts of logistic
regression, cluster analysis and discriminant
analysis
Determine when to use appropriate variations of
each technique
Understand the data requirements for performing
these analyses
Use SPSS software to perform basic logistic,
cluster and discriminant analyses

4
Learning Objectives

Know how to interpret the tabular and graphical
outputs of these procedures to evaluate the
validity and reliability of various solutions
Prepare reports on the results of these analysis
for professional or lay audiences
Understand the relationship between these and
related statistical methods

5
Workshop Pre-requisites

Basic Statistics
General Linear Models
Statistical Software
Institutional Research

6
Workshop Method

Introduction to basics using an IR example
On your own exercise using IR datasets
Discussion of methods and issues as you
experience them

7
Workshop Schedule Day 1

Introduction and overview (15)
Logistic regression
Basic concepts with example (45)
On your own examples (30)
Break (30)
Discriminant analysis
Basic concepts with example (30)
On your own examples (30)
Logistic regression v. discriminant analysis

8
Workshop Schedule Day 2

Cluster Analysis
Basic concepts with example (45)
Example Peer institution identification (30)
Break (30)
Decision Tree Techniques
Basic concepts with example (30)
Free play (45)

9
Overview

Analyzing differences among existing groups
Extending the regression model to look at a
(dichotomous) group outcome variable
Logistic regression
Discriminant analysis
Identifying groups out of whole cloth
Cluster analysis
Focus on proximity aspect
Decision trees as a hybrid model
CHAID

10
Workshop Datasets
11
Existing Group Differences

The outcome of interest is membership in a group
Retained vs. non-returning students
Admits who matriculate vs. those who dont
Alums who donate vs. those who dont
Faculty who get grants vs. those who dont
Institutions that get sanctioned for assessment
on their accreditation visits vs. those who dont
Class sections that meet during the day vs.
evening

12
Three Basic Questions

Which, if any, of the variables are useful for
predicting group membership?
What is the best combination of variables to
optimize predictions among the original sample?
How useful is that combination for classifying
new cases?

13
One or More Groups

Group outcomes can be dichotomous or
polychotomous
Logistic regression and discriminant analysis
can handle both
We will focus on dichotomous case with only lip
service to polychotomous situation

14
Examining Group Differences

Why not t-test or ANOVA?
Group factor is dependent variable (outcome) not
independent variable (predictor, causal agent,
etc.)
No random assignment to group
But we always violate that assumption
Requires normal distribution of outcome

15
The Linear Regression Problem

Group membership as the outcome (dependent)
variable violates an important assumption that
has serious consequences
Under certain conditions the problems are not
completely debilitating
Group membership evenly distributed
Predictors are all solid continuous/normal
variables

16
Two Regression-Based Solutions

Logistic regression
Transforms outcome into continuous odds ratio
Readily accommodates continuous and categorical
predictors
Interpretations differ substantially from OLS
linear form
Includes classification matrix
Discriminant analysis
Uses standard OLS procedures
Requires continuous/normal predictors
Interpretations similar to OLS linear regression
Includes classification matrix

17
Remember OLS Linear Form

Finding linear equation that fits pattern best

18
OLS Linear Form

Overall fit of model
Significance of model
Predictor (b) coefficients

19
The Group Outcome Problem

Y equals either 0 or 1
Predictions can be gt 1 or lt 0
Coefficients may be biased
Heteroscedasticity is present
Error term is not normally distributed so
hypothesis tests are invalid

20
The Logistic Regression Solution

Use a different method for estimating parameters
Maximum Likelihood Estimates (MLE) instead of OLS
Maximizes the probability that predicted Y equal
observed Y
Transforms the outcome variable into a form that
is continuous/normal
The natural log of the odds ratio or logit
Ln(P/(1-P)) a b1x1 b2x2

21
The Odds Ratio

The probability of being in one group relative to
the probability of being in the other group
If P(group 1) .5, the odds ratio is 1 (.5/.5)
If the retention rate is 80, the odds ratio is
0.8/0.2 4 (odds of 4 to 1 of being retained)
If the yield rate is 67, the odds ratio is
0.67/0.33 2 or 2 to 1.
If 12.5 of of alums donate, the odds ratio is
0.125/0.875 .143 or 1 to 7

22
Predictors and Coefficients

Predictors can be continuous or categorical
(dummy) variables
Coefficient shows the change in the Ln(P(1-P))
for unit change in predictor
Can be converted into marginal effects effect on
probability that Y1 for unit change in X
Not easy to explain but
Can talk in general terms (positive, negative,
zero)
Have classification statistics that are more
intuitive

23
Logistic Regression in SPSS

Retention dataset

24
Retention Example Output

Ominbus Tests
Model Summary
Classification table

25
Retention Example Output

Predictor statistics

26
Interpreting Logistic Reg Output

Ominbus Tests
Overall significance of model
Relative performance of one model v. another
Model summary
Goodness of fit R2 Statistics
Several versions, none of which are real R2
Classification table
Ability to successfully classify from prediction
Remember prediction is probability that then has
to be categorized

27
Interpreting Coefficients

B value is change in ln(odds ratio) for unit
change in predictor
S.E. is error in predictor
Relates to significance and estimation
Wald Statistic like the t-value in OLS Linear
Has corresponding significance level
Can be incorrect for large coefficients
Exp(B) is marginal effect
effect on probability that Y1 for unit change in
X

28
Interpreting Coefficients
A unit change in SATACT increases the odds that
the student will be retained by a factor of 1,
that is, not at all
A full-time student is twice as likely to be
retained then a part-time student
A unit (full grade) change in GPA increases by
more than a factor of 2 the likelihood that a
student will be retained
29
On Your Own

Try different variables or entry methods on
retention data
Predict admissions yield status with application
data set
Predict full- vs. part-time faculty status with
faculty data set
Distinguish between two Carnegie categories on
institutional data set
Dont forget to select only two groups

30
Questions and Answers
31
Discriminant Analysis

Closer to linear regression in form and
interpretation
Predictors must by continuous or dichotomous
Logistic regression can handle polychotomous
categorical variables
Can be used for multi-group outcome
Generates k-1 orthogonal solutions for k groups

32
Requirements for Discriminant

Two or more groups
At least two cases per group
Any number of discriminating variables, provided
that it is less than the total number of cases
minus 2
Discriminating variables are on an interval or
ratio scale

33
Requirements for Discriminant

No discriminating variable can be a linear
combination of any other discriminating variable
The covariance matrices for each group must by
approximately equal, unless special formulas are
used
Each group had been drawn from a population with
a multivariate normal distribution on the
discriminating variables

34
Interpreting Group Differences

Which variables best predict differences
What is the best combination of predictors The
canonical discriminant function

35
The Discriminant Function

Derivation
Like regression, maximize the between-groups sum
of squares, relative to the within-groups sum of
squares for the value D
Interpretation
Overall function statistics
Predictor variable statistics

36
Retention Example

Overall model statistics
Eigenvalue/Canonical Correlation
Wilks Lambda
1 Canonical2

37
Predictor Coefficients

Standardized Discriminant Coefficients
Variables with largest (absolute) coefficient
contribute most to prediction of group membership
Sign is direction of effect

38
Retention Coefficients

Structure coefficients
Correlation between each predictor and overall
discriminant function

39
Classification in Discrminant

Prior probabilities
Can be .5 or set by size of group

40
Classification in Discriminant

Accuracy within each group as well as overall

41
The Classification Matrix

Comparing actual group membership against
predicted group membership, using the
classification function
Can have an "unknown" region
Split samples can (should?) be used to further
test the accuracy of classification

42
The Classification Matrix

Measures of interest include
Overall prediction accuracy
(ad)/N
Sensitivity-accuracy among positive cases
a/(ac)
Specificity-accuracy among negative cases
d/(bd)
False positive rate
b/(bd)
False negative rate
c/(ac)

43
The Confusion Matrix

But thats not all
Prevalence (ac)/N
Overall diagnostic power (bd)/N
Positive predictive power a/(ab)
Negative predictive power d/(cd)
Misclassification rate (bc)/N
Odds ratio (ad)/(bc)
Kappa (a d) - (((a c)(a b) (b d)(c
d))/N) N - (((a c)(a b) (b d)(c
d))/N)
NMI 1 - -a.ln(a)-b.ln(b)-c.ln(c)-d.ln(d)(ab).ln(
ab)(cd).ln(cd) N.lnN -
((ac).ln(ac) (bd).ln(bd))

44
Adjusting for Prior Probabilities or the "Costs"
of Misclassification

Methods so far have considered each group equally
One can take into account known differences in
group composition
This usually takes one of two forms
Prior information regarding the likely
distribution of group size
There are known higher "costs" of misclassifying
objects into one group compared to in the other

45
.5 v. Group Size Cutoffs-Discrim
46
.5 v. Group Size Cutoffs-Logistic
47
Logistic v. Discrim Classification

Classification measure calculator for 2x2
http//members.aol.com/johnp71/ctab2x2.html

48
On Your Own

Rerun your logistic regressions as discriminant
analyses
Play with different cutoff conditions
.5 vs. predicted from group size for discriminant
Set your own value for logistic regression

49
Questions and Answers
50
Logistic vs. Discriminant

Logistic
Accommodates categorical predictors with gt 2
groups
Fewer assumptions
More robust
Easier to use?
Discriminant
Easier to interpret?
More classification features
Can accommodate costs of misclassification

51
Reporting Results

Logistic regression coefficients (and their
anti-logs) are difficult to convey graphically
Positive impact values above 1 range
considerably
Negative impact values below 1 have a limited
range
Delta P is an alternative
Change in probability of outcome given unit
change in predictor

52
Reporting Results

Classification table and some of the related
measures are usually most effective way to convey
usefulness of results
As with all higher level analyses, the most
important point is to interpret in context of
real decisions
E.g., impact of changing selection index cutoff
in terms of entering class size and predicted
change in retention rate

53
Some Reasonable Examples

Smith and Nielsen, Longwood College
http//www.longwood.edu/assessment/Retention_Analy
sis.htm
DePauw University
http//www.depauw.edu/admin/i_research/research/ye
ar1_02supp.pdf

54
Good Night!

Read Chapter 5 of RIR Stats Volume
To reinforce lessons for today and tomorrow
If you are having trouble falling asleep

55
Cluster Analysis

Any of a wide variety of numerical procedures
that can be used to create a classification
scheme
Conceptually easy to understand and well suited
to segmentation studies
It is a heuristic algorithm, not supported by
extensive statistical reasoning
It is entirely data driven
Sometimes yields inconsistent results

56
Cluster Analysis

Creating groups out of whole cloth
Drawing circles around points scattered in
n-dimensional space

57
What Is a Cluster?

A set of objects or points, that are relatively
close to each other and relatively far from
points in other clusters.
This view that tends to favor spherical clusters
over ones of other shapes

58
Steps to Cluster Analysis

Selecting variables
Selecting a similarity or distance measure
Choosing a clustering algorithm

59
Selecting Variables

Most popular forms based on measures of
"similarity" according to some combination of
attributes
The choice of variables is one of the most
critical steps
Should be guided by an explicit theory or at
least solid reasoning
Higher education researchers typically have ready
access to certain types of student
characteristics

60
Choosing a Similarity Measure

Distance measures Spatial relationship
Association measures Similarities or
dissimilarities, using measures of association
(e.g., Correlation, contingency tables)
The type of variable constrains the choice
Nominal variables require either association
coefficients or a decision-tree technique
Continuous variables lend themselves to
distance-type measures

61
Distance-Type Measures

Several are cases of what is called the Minkowski
metric.
Euclidean distance (r 2 distance between two
points in n-dimensional space)
City-block metric (r 1 the sum of differences
along each measure

62
Distance-Type Measures

Another common distance measure is Mahalanobis
D2, which takes into account correlations among
the predictors

63
Standardized vs. Unstandardized Measures

One must be careful about the implications of
using standardized vs. unstandardized measures in
computing these distances.

64
Matching-Type Measures

Association coefficients
The only game in town when the predictors are
nominally scaled.
The predictor variables are usually converted to
binary indicators.
Similarity Coefficients are a form of matching
type measure based on a series of binary
variables that represent the presence or absence
of a trait.

65
Contingency Table-based Similarity Coefficients

Possible coefficients that differ according to
How negative matches (0,0) are incorporated
Whether matched pairs are equally weighted or
doubled
Whether unmatched pairs carry twice the weight of
matched pairs
Whether negative matched are excluded altogether

66
Contingency-Table Measures

(ad)/(abcd) matching coefficient
a/(abcd) Russel/Rao index
a/(abc) Jaccard coefficient
2a/(2abc) Dices coefficient

67
Correlation Coefficients

Pearson r, Spearman r, etc.
Correlation is across variables and between each
pair of objects

Across Variable Correlation
Standard Across Person Correlation
68
The Distance Matrix

Regardless of method, the first step in cluster
analysis is to produce a distance matrix.
A row and column for each object
Cells represent the distance or similarity
measure between each pair.
Symmetric with diagonal of 0's for distance
matrices, or 1's for similarity measures.
This is what makes cluster analyses like these so
computationally intensive.

69
Choosing a Clustering Algorithm

Hierarchical algorithms
Agglomerative methods start with each object in
its own cluster and then merges points and
clusters until some criteria is reached
Single linkage (nearest neighbor)
Complete linkage (furthest neighbor)
Average linkage
Wards error sum of squares

70
Choosing a Clustering Algorithm

Hierarchical algorithms (continued)
Divisive methods start with one group of the
whole and partitioning objects into smaller
clusters until some criteria is reached.
Splinter-average distance
Decision tree methods
Partitioning algorithms
K-means clustering
Trace-based methods

71
Peer Institution Example

Variables Derived from IPEDS

72
Create Proximity Matrix

Screen institutions to a manageable number (lt
300)
Select ClassifyHierarchical Cluster
Place predictors in Variable box
Under Method, choose Z scores for standardize
box (by variable).
Paste syntax
Erase Cluster procedure
Change proximity matrix file name so you can find
it
Run it

73
Using Proximity Matrix

Find target institution (sort by name)
Identify Varname and find target column
Get rid of excess columns
Sort (ascending) by varname column
VOILA! Institutions now sorted by similarity to
target

74
Graphical Clustering Methods

Glyphs, Metroglyphs, Fourier series, and Chernoff
faces

75
Decision Trees

Hybrid between clustering and discriminant
analysis
The criterion variable does not define the groups
But the groups are defined so as to maximize
differences according to the criterion.
The purpose is to identify key variables for
distinguishing among groups and formulating group
membership prediction rules.

76
Functions of Decision Trees

Derive decision rules from data
Develop classification system to predict future
observations
Illustrate these through a decision tree
Discretize continuous variables

77
SPSS AnswerTree

Three decision tree algorithms
All use "brute force" methods

78
Common Features

Merging categories of the predictor variables so
that non-significantly different values are
pooled together
Splitting the variables at points that maximizes
differences
Stopping the branching when further splits do not
contribute significantly
Pruning branches from an existing tree
Validation and error estimation

79
CHAID

Not just binary splits
Handles nominal, ordinal, and continuous
variables
Useful for discretizing continuous variables
Demo and sample output included within session
support materials

80
(No Transcript)
81
Playing with CHAID, etc.

Work with retention dataset
Use retention status as criteria
Use Semester GPA as criteria
Try newer institutional dataset
Use graduation rate as criteria
Try nearest neighbor analysis with newer data

82
Final Thoughts

Flexibility of logistic regression models make it
the coin of the realm
E.g., Multinomial logistic HLM regression
Cluster analysis is so data driven as to make its
use fairly limited
Threshold approach to peer identification much
more popular but its always good to run things
multiple ways (See IR Primer Peer Chapter)
CHAID is fun to play with and informative

83
(No Transcript)

Write a Comment

User Comments (0)