Title: Cluster Analysis
1Cluster Analysis
- Jay E. Aronson
- Professor of MIS
- MIS_at_Terry College of Business
- The University of Georgia
- Brooks Hall
- Athens, GA 30602-6273
- jaronson_at_terry.uga.edu
- 706/542-0991
2Cluster Analysis
- Classification problems
- Analysis methods
- Common in biology, medicine, genetics, social
sciences, anthropology, archeology - MIS Systems Development
- Sorting
- To target your best customers
- Identify GROUPS of your best customers
3Cluster Analysis
- Very important set of methods
- Only 476,000 Web sites found
- Related to
- Discriminant analysis
- Maximum diversity problem
- Data mining
4Cluster Analysis Definition
- Exploratory data analysis tool for solving
classification problems. - Object to sort cases (people, things, events,
etc.) into groups, or clusters, so that the
degree of association is strong between members
of the same cluster and weak between members of
different clusters. - Each cluster describes the class to which its
members belong
5Cluster Analysis
6Constellations
- Star Charts
- Natural groupings of stars
7Cluster Analysis
- A a tool of discovery
- May reveal associations and structure in data
- which, though not previously evident,
nevertheless are sensible and useful once found
8Cluster Analysis Results May
- Contribute to the definition of a formal
classification scheme, such as a taxonomy for
related customers, animals, insects or plants - Suggest statistical models with which to describe
populations - Indicate rules for assigning new cases to classes
for identification, targeting, and diagnostic
purposes - Provide measures of definition, size and change
in what previously were only broad concepts - Find exemplars to represent classes
9Cluster Analysis Examples -1
- Character recognition logic in OCR readers
Automated mass health screening of blood samples - Typological models for predicting educational
attainment - Survey respondents by personality and attitude
data - Tourist behavior patterns on camping vacations
10Cluster Analysis Examples -2
- Consumer goods by content, brand loyalty or
similarity - Product market typology for tailoring sales
strategies - Retail store layouts and sales performances
- Data base relationships in management information
systems
11Relationships to
- Data Mining
- KDD/KDT Knowledge Discovery in Databases/Text
- OLAP (Online Analytical Processing)
- Customer Relationship Management (CRM)
- Other areas
12Quick Clustering Example 1
- Determine (3) U.S. Federal Tax Brackets that are
fair, and generate the most revenue
Potential Taxes Paid
Income
13Quick Clustering Example 2
- Establish Grades for My Class F 2001
- Find Breakpoints for A, B, C, D, F
14Clustering Methods
- Statistical
- Hierarchical
- Nonhierarchical
- Optimal
- Neural networks (fraud detection)
- Fuzzy Cluster Analysis
- Genetic Algorithms
- Magic Hat (Rowling, J.K., Harry Potter and the
Goblet of Fire, Arthur A. Levine Books, New York,
2000., image from Harry Potters World of Magic
Theme Park Web Site, www.angelfire.com/tx5/worldo
fmagic, 2002.)
15Quick Clustering Example 3
- Sort the First Year students at Hogwarts School
for Wizards into the four Houses Gryffindor,
Ravenclaw, Hufflepuff, and Slytherin- a Sorting
Hat. - Professor McGonagall now placed a three-legged
stool on the ground before the first years and,
on top of it, an extremely old, dirty patched
wizards hat. - Then a long tear near the brim opened like a
mouth, and the hat broke into song -
- Twas Gryffindor who found the way,
- He whipped me off his head
- The founders put some brains in me
- So I could choose instead!
- Now slip me snug about your ears,
- Ive never yet been wrong,
- Ill have a look inside your mind
- And tell where you belong!
16Clustering Methods
- Divisive
- All items start in ONE cluster
- Break the clusters apart
- Agglomerative
- All items start in individual clusters
- Join the items together
17Distance (Similarity)
- Similarity or distance matrix
- Rows and columns are the items
- Cell entries are a similarity or distance measure
for any pair of cases - Euclidian distance - most common
- PROBLEMS
- Qualitative measures
- May be more than 2-dimensions
- Scaling
- May not be a Euclidean problem (no metric)
18Euclidean Distance
Y
(X2,Y2)
SQRT(Y2-Y1)2(X2-X1)2
5
SQRT(5-2)2(4-2)2 SQRT(94) SQRT(13)
3.606
(X1,Y1)
2
X
2
4
19Banking Cluster Analysis Example
- The Co-opérative Desjardins' Movement
- Largest banking institution in Québec
- 1,329 branches, 4.2 million members
- Combined assets in excess of 80bn C
- Reducing tellers' service, increasing ATM use and
other IT methods, and reducing its staff
20More Than Just a Bank
- Current accounts, loans and mortgages
- Other financial products through subsidiaries
- Life and property insurance
- Money transmission, etc.
- Each branch is independent and can decide which
of the bank's products and policies to adopt - The Confédération has to market its products to
its own branches and members.
21The Bank Wants a
- Typology of its members
- To retain members loyalty by designing the best
possible financial products to meet their needs - To capture more market share by identifying
profitable services which satisfy members needs
and improve market penetration
22Hierarchical Cluster Analysisat the Bank
- Sample of 16,000 members
- 16 variables that reflect the characteristics of
financial transaction patterns - 30 member types were identified
23Next
- Single-pass identification model
- Classify all 4.2m Québec members
- Compared each with the classification
- Found the cluster of best fit (after 10-12
comparisons) - Provided similarity between each member and all
30 clusters
24Bank Marketing
- Financial managers could identify members whose
financial transactions were clearly of one type,
or possibly a combination of two or more types - Estimated the profitability of each transaction
cluster and of individual customer accounts - Portfolio management by branch managers
- Provided valuable information for market
segmentation and marketing
25Bank Results 1
- Members with large transaction volumes through
one account with capital or a loan elsewhere - Can suggest a more economical consolidating
approach (keep the customers satisfied) - Can suggest better diversification of members'
capital - (Utilizing a 60,000 (C) guarantee by the Québec
government available to all of the bank's
members) - Generates MORE INCOME
26Bank Results 2
- Results useful for marketing
- Bank can focus on products with the best
financial performance - Reduce direct mailing costs
- Increase response rates by targeting product
promotions at those customer types most likely to
respond - Achieve better branding and customer retention
- The bank can retain and win the business of more
profitable customers at lower costs
27Generalized Hierarchical Clustering
Method(Sequential Clustering)
- 1. Decide which data to record from your cases
- 2. Calculate the distance between all initial
clusters. Store the results in a distance matrix - 3. Search through the distance matrix and find
the two most similar clusters
28Continued
- 4. Fuse these two clusters to produce a cluster
that now has at least 2 cases - 5. Calculate the distances between this new
cluster and all other clusters (which may contain
only one case) - 6. Repeat step 3 until all cases are in one
cluster
291. Decide which data to record from your cases
- Think carefully about data types
- It is very difficult to mix data types
- Interval data (weight) cannot be easily combined
with attribute data (blood group)
302. Calculate the distance between all initial
clusters. Store the results in a distance matrix
- Typically, initial clusters are individual items
313. Search through the distance matrix and find
the two most similar clusters
- Break ties with a pre-determined rule
324. Fuse these two clusters to produce a cluster
that now has at least 2 cases
- The number of clusters decreases by one
335. Calculate the distances between this new
cluster and all other clusters (which may contain
only one case)
- No need to recalculate all the distances
- Only those involving the new cluster will have
changed
346. Repeat step 3 until all cases are in one
cluster
35Question/Issue
- When do you stop?
- If you cluster everything, which stage indicates
what the real groupings are? - There is a need to identify optimal clusters!
36Hands-On Exercise 1
- Using sequential clustering
- (start with 8 clusters of 1 item each)
- Find 3 clusters in the following data
37Hands-On Exercise 1Similarity (Distance) Matrix
38Solution
- 1 3 6
- 2 8
- 4 5 7
- Note No objective VALUE of the solution is
possible - Cannot draw a picture/graph with just the matrix
need graph but what if more than 2-dimensions?
39Issues
- Should the clusters all have (nearly) the same
number of items? - What about dimensional scaling?
- Can distance really be measured?
- How do you determine how many clusters to find?
(When to stop?)
40Optimal Clustering
- Initially developed for use in MIS systems
development - Involves a mathematical model that describes the
clustering situation - Does NOT require a Euclidean metric
- Does require a similarity matrix
- The items in each group interact pairwise
- Maximizes the total similarity measure of the
items in the groups
41Optimal Clustering Applications
- MIS Development/Design
- Assembly Line Balancing
- Seating wedding guests
42Optimal Cluster Analysis
43Optimal Cluster Analysis
44Optimal Cluster Analysis Model
- Maximize the sum of the pairwise interactions of
all items in each groups - subject to
- Each item must be in exactly one group
- Each item may not be split (integer 0-1)
- Solved by an efficient integer, branch-and-bound
algorithm (Aronson and Klein)
45May Also Consider
- The number of clusters to be used
- Cluster size lower and upper limits
- Cluster weight lower and upper limits
- Precedence relationships
- item 4 must be in a cluster with a lower number
than item 7, and must precede it in assignment
46Hands-On Exercise 2 Optimal Clustering
Similarity Matrix
47Solution Trial 1
- 1 2 3, Value 394 16
- 4 5 6, Value 1024 16
- 7 8, Value 5
- TOTAL 37
48Solution Trial 2
- 1 3 4, Value 925 16
- 3 5 6, Value 7104 21
- 7 8, Value 5
- TOTAL 72
49Solution (Optimal)
- 1 3 6, Value 9610 25
- 2 8 , Value 8
- 4 5 7 , Value 1089 27
- TOTAL 60
- Note Has a measurable objective VALUE
50Issues
- The weighting scheme How to determine scaling
of the dimensions
51Cluster Analysis Software Sampler
- ClustanGraphics 5 Graphical display of
clusters, and others (www.clustan.com) - DecisionWORKS Suite, Advanced Software
Applications (www.asacorp.com) - SPSS (www.spss.com)
- PolyAnalyst (Cluster Engine) by Megaputer
(www.megaputer.com) - Sokal (see Hand, 1981)
52ClustanGraphics 5 -1
- Cluster hierarchical cluster analysis on a data
matrix - Hierarchy hierarchical cluster analysis on a
proximity matrix - Centroid centroid forming method using a
proximity matrix - Density hierarchical density-seeking method
- Divide hierarchical divisive clustering on
binary variables
53ClustanGraphics 5 -2
- Classify identifies new cases by traversing a
tree - Normix maximum likelihood estimation of
multivariate normal mixture - Invariant iterative optimization of Wilks'
Lambda or Hotelling's Trace - Mode finds the modes in a sample density
- Relocate iterative reallocation to clusters
(k-means) - Kdend seeks Bk overlapping clusters
- Dndrite division of minimum spanning tree to
minimize sum of squares - Euclid fuzzy clustering to minimize sum of
squares
54ClustanGraphics 5 -3
- Read similarity matrix reads a proximity matrix
- Calculate similarity matrix calculates a
proximity matrix - Print results output of cluster diagnostic
results - Scatter plots scatter and cluster diagrams
- Plink plots hierarchical clustering trees
- Rules significance tests for best partition
- Compare compares hierarchical classifications
55CRM 1 Customer Resource/Relationship Management
- Organize your customer base hierarchically
- into nested sales channels
- Identify types of customers, and their critical
needs - Focus on your most profitable clusters
- Identify their critical needs, and why they are
profitable - Hone marketing more specifically in these
directions
56CRM -2
- Talk to your best customers in each cluster,
- Understand how they view your company's products
and services - Identify strengths and weaknesses
- Determine how to capitalize on key strengths
57CRM -3
- Look at the customer clusters which represent
poor performance or high costs - Does profitability justify the investment?
- Convert to more profitable customers?
- Drop some? (BE CAREFUL)
58Fraud Detection(Your Worst Customers)
- Credit card fraud
- E-commerce fraud
- How to detect?
- Look for unusual patterns/clusters in sales
- If a purchase (data point) falls outside of a
cluster by a certain amount contact the
customer - Neural networks
59Cluster Analysis Summary
- Many methods
- Many important applications
especially for Revenue Management
60Some Source Materials -1
- Aldenderfer, Mark S. and Roger K. Blashfield,
Cluster Analysis. Sage Publications, Thousand
Oaks, CA, Quantitative Applications in the Social
Sciences Series No. 44, 1984. - Aronson, Jay E. and Gary Klein, "A Clustering
Algorithm for Computer- Assisted Process
Organization," Decision Sciences, 20, 4, Fall
1989, 730-745. - Aronson, Jay E. and Lakshmi S. Iyer, "Cluster
Analysis," Encyclopedia of Operations Research
Management Science, 2nd ed., Gass, Saul I. and
Carl M. Harris (eds.), Kluwer Academic
Publishers, Norwell, MA, 2001. - Clustan Web Site (www.clustan.com), Clustan Ltd.
- Corter, James E. Tree Models of Similarity and
Association, Sage Publications, Thousand Oaks,
CA, Quantitative Applications in the Social
Sciences Series No. 112 , 1996. - Duda, R., P. Hart, and D. Stork, Pattern
Classification, 2nd edition. John Wiley Sons,
Inc., New York, 1998. - Fielding, Alan, Cluster Analysis Web Pages,
Department of Biological Sciences, Manchester
Metropolitan University, Manchester, UK,
149.170.199.144 - Fuzzy Cluster Analysis Web Pages
(www.fuzzy-clustering.com.de)
61Some Source Materials -2
- Garson, David, Cluster Analysis Web Pages,
Department of Public Administration North
Carolina State University, Raleigh, NC
(www2.chass.ncsu.edu/garson/pa765) - Goulet, Michel and David Wishart, Classifying a
bank's customers to improve their financial
services, Conference of the Classification
Society of North America (CSNA), University of
Massachusetts, Amherst, MA, June 1996. - Kachigan, Sam K., Multivariate statistical
analysis, Radius Press, New York, 1982. See
Chapter 8. - Kaufman, Leonard and Peter J. Rousseeuw, Finding
Groups in Data An Introduction to Cluster
Analysis, Wiley, New York, 1990.
- Klein, Gary and Jay E. Aronson, "Optimal
Clustering A Model and Method," Naval Research
Logistics, 38, 1, 1991, 1-15. - Hand, D., Discrimination and Classification,
Wiley, New York, 1981. - Mulvey, J. and H. Crowder, Cluster Analysis An
Application of Lagrangian Relaxation, Management
Science, Vol. 25, 1979, 329-340. - Romesbug, H., Cluster Analysis for Researchers,
Lifetime Learning Publications, Belmont, CA,
1984. - Swift, Ronald S., Accelerating Customer
Relationships, Prentice Hall PTR, Upper Saddle
River, NJ, 2001. - Zupan, J., Clustering of Large Data Sets,
Research Studies Press, New York, 1982.
62Thank You