Title: Data Mining Techniques Clustering
1Data Mining Techniques Clustering
2Purpose
- In clustering analysis, there is no
pre-classified data - Instead, clustering analysis is a process where a
set of objects is partitioned into several
clusters - All members in one cluster are similar to each
other and different from the members of other
clusters, according to some similarity metric
(e.g., the opposite of distance between objects)
3Cluster Analysis
Cluster
Y (Age)
Customer (Object)
X (Income)
Variables
4Cluster Analysis
n objetcs p variables
Data Matrix
Dissimilarity Matrix (n?n)
5Attribute Types Involved in Cluster Analysis
- Interval Variables
- An interval variable contains continuous
measurements (e.g., height, weight, temperature,
cost, etc.) which follow a linear scale - It is essential that intervals keep the same
importance throughout the scale - Nominal Variables
- A nominal variable takes on more than two states.
For example, the eye color of a person can be
blue, brown, green or grey eyes - These states may be coded as 1, 2, ..., M,
however their order and the interval between any
two states do not have any meaning
6Attribute Types Involved in Cluster Analysis
- Ordinal Variables
- An ordinal variable takes on more than two
states. For example, you may ask someone to
convey his/her appreciation of some paintings in
terms of the following categories 1detest,
2dislike, 3indifferent, 4like and 5admire - In an ordinal variable, their states are ordered
in a meaningful sequence. However, the interval
between any two consecutive states are not
equally distanced - Binary Variables
- Binary variables have only two possible states.
For example, the gender of a person is either
female or male
7Dissimilarity (Distance) Measure
8Dissimilarity (Distance) Measure
9Dissimilarity (Distance) Measure
10Dissimilarity (Distance) Measure
11Dissimilarity (Distance) Measure
12Dissimilarity (Distance) Measure
13Dissimilarity (Distance) Measure
14Dissimilarity (Distance) Measure
15Dissimilarity (Distance) Measure
16Categorization of Clustering Methods
- Exclusive vs. Non-Exclusive (Overlapping)
- Hierarchical Methods vs. Partitioning Methods
- Hierarchical Methods
- Single Link Method
- Complete Link Method
- Partitioning Methods
- Kohonen Self-Organizing Feature Maps
- K-Means Methods
- K-Medoids Methods (PAM, CLARA, CLARANS)
- Density-Based Methods
17Hierarchical Methods
Dissimilarity Matrix (5?5)
18K-Means Methods
19K-Means Methods
20K-Means Methods
21K-Means Methods
Sensitive to Outlier!
22Exercise 7
Number
of clusters 2 Using Single Link, Complete Link
and K-Means to cluster the following data
Object X Y
1 22 60
2 40 25
3 60 30
4 64 66
5 80 30
6 82 55