Cluster Analysis - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Cluster Analysis

Description:

Common in biology, medicine, genetics, social sciences, anthropology, archeology ... Sort the First Year students at Hogwarts School for Wizards into the four Houses: ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 63
Provided by: jayear
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Jay E. Aronson
  • Professor of MIS
  • MIS_at_Terry College of Business
  • The University of Georgia
  • Brooks Hall
  • Athens, GA 30602-6273
  • jaronson_at_terry.uga.edu
  • 706/542-0991

2
Cluster Analysis
  • Classification problems
  • Analysis methods
  • Common in biology, medicine, genetics, social
    sciences, anthropology, archeology
  • MIS Systems Development
  • Sorting
  • To target your best customers
  • Identify GROUPS of your best customers

3
Cluster Analysis
  • Very important set of methods
  • Only 476,000 Web sites found
  • Related to
  • Discriminant analysis
  • Maximum diversity problem
  • Data mining

4
Cluster Analysis Definition
  • Exploratory data analysis tool for solving
    classification problems.
  • Object to sort cases (people, things, events,
    etc.) into groups, or clusters, so that the
    degree of association is strong between members
    of the same cluster and weak between members of
    different clusters.
  • Each cluster describes the class to which its
    members belong

5
Cluster Analysis
6
Constellations
  • Star Charts
  • Natural groupings of stars

7
Cluster Analysis
  • A a tool of discovery
  • May reveal associations and structure in data
  • which, though not previously evident,
    nevertheless are sensible and useful once found

8
Cluster Analysis Results May
  • Contribute to the definition of a formal
    classification scheme, such as a taxonomy for
    related customers, animals, insects or plants
  • Suggest statistical models with which to describe
    populations
  • Indicate rules for assigning new cases to classes
    for identification, targeting, and diagnostic
    purposes
  • Provide measures of definition, size and change
    in what previously were only broad concepts
  • Find exemplars to represent classes

9
Cluster Analysis Examples -1
  • Character recognition logic in OCR readers
    Automated mass health screening of blood samples
  • Typological models for predicting educational
    attainment
  • Survey respondents by personality and attitude
    data
  • Tourist behavior patterns on camping vacations

10
Cluster Analysis Examples -2
  • Consumer goods by content, brand loyalty or
    similarity
  • Product market typology for tailoring sales
    strategies
  • Retail store layouts and sales performances
  • Data base relationships in management information
    systems

11
Relationships to
  • Data Mining
  • KDD/KDT Knowledge Discovery in Databases/Text
  • OLAP (Online Analytical Processing)
  • Customer Relationship Management (CRM)
  • Other areas

12
Quick Clustering Example 1
  • Determine (3) U.S. Federal Tax Brackets that are
    fair, and generate the most revenue

Potential Taxes Paid
Income
13
Quick Clustering Example 2
  • Establish Grades for My Class F 2001
  • Find Breakpoints for A, B, C, D, F

14
Clustering Methods
  • Statistical
  • Hierarchical
  • Nonhierarchical
  • Optimal
  • Neural networks (fraud detection)
  • Fuzzy Cluster Analysis
  • Genetic Algorithms
  • Magic Hat (Rowling, J.K., Harry Potter and the
    Goblet of Fire, Arthur A. Levine Books, New York,
    2000., image from Harry Potters World of Magic
    Theme Park Web Site, www.angelfire.com/tx5/worldo
    fmagic, 2002.)

15
Quick Clustering Example 3
  • Sort the First Year students at Hogwarts School
    for Wizards into the four Houses Gryffindor,
    Ravenclaw, Hufflepuff, and Slytherin- a Sorting
    Hat.
  • Professor McGonagall now placed a three-legged
    stool on the ground before the first years and,
    on top of it, an extremely old, dirty patched
    wizards hat.
  • Then a long tear near the brim opened like a
    mouth, and the hat broke into song
  • Twas Gryffindor who found the way,
  • He whipped me off his head
  • The founders put some brains in me
  • So I could choose instead!
  • Now slip me snug about your ears,
  • Ive never yet been wrong,
  • Ill have a look inside your mind
  • And tell where you belong!

16
Clustering Methods
  • Divisive
  • All items start in ONE cluster
  • Break the clusters apart
  • Agglomerative
  • All items start in individual clusters
  • Join the items together

17
Distance (Similarity)
  • Similarity or distance matrix
  • Rows and columns are the items
  • Cell entries are a similarity or distance measure
    for any pair of cases
  • Euclidian distance - most common
  • PROBLEMS
  • Qualitative measures
  • May be more than 2-dimensions
  • Scaling
  • May not be a Euclidean problem (no metric)

18
Euclidean Distance
Y
(X2,Y2)
SQRT(Y2-Y1)2(X2-X1)2
5
SQRT(5-2)2(4-2)2 SQRT(94) SQRT(13)
3.606
(X1,Y1)
2
X
2
4
19
Banking Cluster Analysis Example
  • The Co-opérative Desjardins' Movement
  • Largest banking institution in Québec
  • 1,329 branches, 4.2 million members
  • Combined assets in excess of 80bn C
  • Reducing tellers' service, increasing ATM use and
    other IT methods, and reducing its staff

20
More Than Just a Bank
  • Current accounts, loans and mortgages
  • Other financial products through subsidiaries
  • Life and property insurance
  • Money transmission, etc.
  • Each branch is independent and can decide which
    of the bank's products and policies to adopt
  • The Confédération has to market its products to
    its own branches and members.

21
The Bank Wants a
  • Typology of its members
  • To retain members loyalty by designing the best
    possible financial products to meet their needs
  • To capture more market share by identifying
    profitable services which satisfy members needs
    and improve market penetration

22
Hierarchical Cluster Analysisat the Bank
  • Sample of 16,000 members
  • 16 variables that reflect the characteristics of
    financial transaction patterns
  • 30 member types were identified

23
Next
  • Single-pass identification model
  • Classify all 4.2m Québec members
  • Compared each with the classification
  • Found the cluster of best fit (after 10-12
    comparisons)
  • Provided similarity between each member and all
    30 clusters

24
Bank Marketing
  • Financial managers could identify members whose
    financial transactions were clearly of one type,
    or possibly a combination of two or more types
  • Estimated the profitability of each transaction
    cluster and of individual customer accounts
  • Portfolio management by branch managers
  • Provided valuable information for market
    segmentation and marketing

25
Bank Results 1
  • Members with large transaction volumes through
    one account with capital or a loan elsewhere
  • Can suggest a more economical consolidating
    approach (keep the customers satisfied)
  • Can suggest better diversification of members'
    capital
  • (Utilizing a 60,000 (C) guarantee by the Québec
    government available to all of the bank's
    members)
  • Generates MORE INCOME

26
Bank Results 2
  • Results useful for marketing
  • Bank can focus on products with the best
    financial performance
  • Reduce direct mailing costs
  • Increase response rates by targeting product
    promotions at those customer types most likely to
    respond
  • Achieve better branding and customer retention
  • The bank can retain and win the business of more
    profitable customers at lower costs

27
Generalized Hierarchical Clustering
Method(Sequential Clustering)
  • 1. Decide which data to record from your cases
  • 2. Calculate the distance between all initial
    clusters. Store the results in a distance matrix
  • 3. Search through the distance matrix and find
    the two most similar clusters

28
Continued
  • 4. Fuse these two clusters to produce a cluster
    that now has at least 2 cases
  • 5. Calculate the distances between this new
    cluster and all other clusters (which may contain
    only one case)
  • 6. Repeat step 3 until all cases are in one
    cluster

29
1. Decide which data to record from your cases
  • Think carefully about data types
  • It is very difficult to mix data types
  • Interval data (weight) cannot be easily combined
    with attribute data (blood group)

30
2. Calculate the distance between all initial
clusters. Store the results in a distance matrix
  • Typically, initial clusters are individual items

31
3. Search through the distance matrix and find
the two most similar clusters
  • Break ties with a pre-determined rule

32
4. Fuse these two clusters to produce a cluster
that now has at least 2 cases
  • The number of clusters decreases by one

33
5. Calculate the distances between this new
cluster and all other clusters (which may contain
only one case)
  • No need to recalculate all the distances
  • Only those involving the new cluster will have
    changed

34
6. Repeat step 3 until all cases are in one
cluster
  • Process summary

35
Question/Issue
  • When do you stop?
  • If you cluster everything, which stage indicates
    what the real groupings are?
  • There is a need to identify optimal clusters!

36
Hands-On Exercise 1
  • Using sequential clustering
  • (start with 8 clusters of 1 item each)
  • Find 3 clusters in the following data

37
Hands-On Exercise 1Similarity (Distance) Matrix
38
Solution
  • 1 3 6
  • 2 8
  • 4 5 7
  • Note No objective VALUE of the solution is
    possible
  • Cannot draw a picture/graph with just the matrix
    need graph but what if more than 2-dimensions?

39
Issues
  • Should the clusters all have (nearly) the same
    number of items?
  • What about dimensional scaling?
  • Can distance really be measured?
  • How do you determine how many clusters to find?
    (When to stop?)

40
Optimal Clustering
  • Initially developed for use in MIS systems
    development
  • Involves a mathematical model that describes the
    clustering situation
  • Does NOT require a Euclidean metric
  • Does require a similarity matrix
  • The items in each group interact pairwise
  • Maximizes the total similarity measure of the
    items in the groups

41
Optimal Clustering Applications
  • MIS Development/Design
  • Assembly Line Balancing
  • Seating wedding guests

42
Optimal Cluster Analysis
43
Optimal Cluster Analysis
44
Optimal Cluster Analysis Model
  • Maximize the sum of the pairwise interactions of
    all items in each groups
  • subject to
  • Each item must be in exactly one group
  • Each item may not be split (integer 0-1)
  • Solved by an efficient integer, branch-and-bound
    algorithm (Aronson and Klein)

45
May Also Consider
  • The number of clusters to be used
  • Cluster size lower and upper limits
  • Cluster weight lower and upper limits
  • Precedence relationships
  • item 4 must be in a cluster with a lower number
    than item 7, and must precede it in assignment

46
Hands-On Exercise 2 Optimal Clustering
Similarity Matrix
47
Solution Trial 1
  • 1 2 3, Value 394 16
  • 4 5 6, Value 1024 16
  • 7 8, Value 5
  • TOTAL 37

48
Solution Trial 2
  • 1 3 4, Value 925 16
  • 3 5 6, Value 7104 21
  • 7 8, Value 5
  • TOTAL 72

49
Solution (Optimal)
  • 1 3 6, Value 9610 25
  • 2 8 , Value 8
  • 4 5 7 , Value 1089 27
  • TOTAL 60
  • Note Has a measurable objective VALUE

50
Issues
  • The weighting scheme How to determine scaling
    of the dimensions

51
Cluster Analysis Software Sampler
  • ClustanGraphics 5 Graphical display of
    clusters, and others (www.clustan.com)
  • DecisionWORKS Suite, Advanced Software
    Applications (www.asacorp.com)
  • SPSS (www.spss.com)
  • PolyAnalyst (Cluster Engine) by Megaputer
    (www.megaputer.com)
  • Sokal (see Hand, 1981)

52
ClustanGraphics 5 -1
  • Cluster hierarchical cluster analysis on a data
    matrix
  • Hierarchy hierarchical cluster analysis on a
    proximity matrix
  • Centroid centroid forming method using a
    proximity matrix
  • Density hierarchical density-seeking method
  • Divide hierarchical divisive clustering on
    binary variables

53
ClustanGraphics 5 -2
  • Classify identifies new cases by traversing a
    tree
  • Normix maximum likelihood estimation of
    multivariate normal mixture
  • Invariant iterative optimization of Wilks'
    Lambda or Hotelling's Trace
  • Mode finds the modes in a sample density
  • Relocate iterative reallocation to clusters
    (k-means)
  • Kdend seeks Bk overlapping clusters
  • Dndrite division of minimum spanning tree to
    minimize sum of squares
  • Euclid fuzzy clustering to minimize sum of
    squares

54
ClustanGraphics 5 -3
  • Read similarity matrix reads a proximity matrix
  • Calculate similarity matrix calculates a
    proximity matrix
  • Print results output of cluster diagnostic
    results
  • Scatter plots scatter and cluster diagrams
  • Plink plots hierarchical clustering trees
  • Rules significance tests for best partition
  • Compare compares hierarchical classifications

55
CRM 1 Customer Resource/Relationship Management
  • Organize your customer base hierarchically
  • into nested sales channels
  • Identify types of customers, and their critical
    needs
  • Focus on your most profitable clusters
  • Identify their critical needs, and why they are
    profitable
  • Hone marketing more specifically in these
    directions

56
CRM -2
  • Talk to your best customers in each cluster,
  • Understand how they view your company's products
    and services
  • Identify strengths and weaknesses
  • Determine how to capitalize on key strengths

57
CRM -3
  • Look at the customer clusters which represent
    poor performance or high costs
  • Does profitability justify the investment?
  • Convert to more profitable customers?
  • Drop some? (BE CAREFUL)

58
Fraud Detection(Your Worst Customers)
  • Credit card fraud
  • E-commerce fraud
  • How to detect?
  • Look for unusual patterns/clusters in sales
  • If a purchase (data point) falls outside of a
    cluster by a certain amount contact the
    customer
  • Neural networks

59
Cluster Analysis Summary
  • Many methods
  • Many important applications
    especially for Revenue Management

60
Some Source Materials -1
  • Aldenderfer, Mark S. and Roger K. Blashfield,
    Cluster Analysis. Sage Publications, Thousand
    Oaks, CA, Quantitative Applications in the Social
    Sciences Series No. 44, 1984.
  • Aronson, Jay E. and Gary Klein, "A Clustering
    Algorithm for Computer- Assisted Process
    Organization," Decision Sciences, 20, 4, Fall
    1989, 730-745.
  • Aronson, Jay E. and Lakshmi S. Iyer, "Cluster
    Analysis," Encyclopedia of Operations Research
    Management Science, 2nd ed., Gass, Saul I. and
    Carl M. Harris (eds.), Kluwer Academic
    Publishers, Norwell, MA, 2001.
  • Clustan Web Site (www.clustan.com), Clustan Ltd.
  • Corter, James E. Tree Models of Similarity and
    Association, Sage Publications, Thousand Oaks,
    CA, Quantitative Applications in the Social
    Sciences Series No. 112 , 1996.
  • Duda, R., P. Hart, and D. Stork, Pattern
    Classification, 2nd edition. John Wiley Sons,
    Inc., New York, 1998.
  • Fielding, Alan, Cluster Analysis Web Pages,
    Department of Biological Sciences, Manchester
    Metropolitan University, Manchester, UK,
    149.170.199.144
  • Fuzzy Cluster Analysis Web Pages
    (www.fuzzy-clustering.com.de)

61
Some Source Materials -2
  • Garson, David, Cluster Analysis Web Pages,
    Department of Public Administration North
    Carolina State University, Raleigh, NC
    (www2.chass.ncsu.edu/garson/pa765)
  • Goulet, Michel and David Wishart, Classifying a
    bank's customers to improve their financial
    services, Conference of the Classification
    Society of North America (CSNA), University of
    Massachusetts, Amherst, MA, June 1996.
  • Kachigan, Sam K., Multivariate statistical
    analysis, Radius Press, New York, 1982. See
    Chapter 8.
  • Kaufman, Leonard and Peter J. Rousseeuw, Finding
    Groups in Data An Introduction to Cluster
    Analysis, Wiley, New York, 1990.
  • Klein, Gary and Jay E. Aronson, "Optimal
    Clustering A Model and Method," Naval Research
    Logistics, 38, 1, 1991, 1-15.
  • Hand, D., Discrimination and Classification,
    Wiley, New York, 1981.
  • Mulvey, J. and H. Crowder, Cluster Analysis An
    Application of Lagrangian Relaxation, Management
    Science, Vol. 25, 1979, 329-340.
  • Romesbug, H., Cluster Analysis for Researchers,
    Lifetime Learning Publications, Belmont, CA,
    1984.
  • Swift, Ronald S., Accelerating Customer
    Relationships, Prentice Hall PTR, Upper Saddle
    River, NJ, 2001.
  • Zupan, J., Clustering of Large Data Sets,
    Research Studies Press, New York, 1982.

62
Thank You
  • Questions/Comments?
Write a Comment
User Comments (0)
About PowerShow.com