Clustering%20of%20%20Non-Quantitative%20Data - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering%20of%20%20Non-Quantitative%20Data

Description:

Also like to Thank two special students. for their helpfulness and kindness ... Car Owned: { Ford, Toyota, Kia, Renault, Mercedes } Nominal Binary: Examples ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 52
Provided by: Lude
Category:

less

Transcript and Presenter's Notes

Title: Clustering%20of%20%20Non-Quantitative%20Data


1
Nanjing University of Science Technology
Clustering of Non-Quantitative Data
Lonnie C. Ludeman New Mexico State University Las
Cruces, New Mexico, USA Nov 23, 2005
2
Like to Thank
Nanjing University of Science Technology
Department of Computer Science
especially
Lu Jian-feng Yang Jing-yu Wang Han
for inviting me to come to NUST
3
Also like to Thank two special students
Wang Qiong Wang Huan
for their helpfulness and kindness during my
tenure at NUST
4
Consider the problem of clustering a standard
american deck of playing cards
Two Possible Solutions
5
Two More Possible Solutions
Clustering of Non-Quantitative Data
6
Clustering is the art of grouping together
pattern vectors that in some sense belong
together because they have similar
characteristics and are somehow different from
other pattern vectors.
In the most general problem the number of
clusters or subgroups is unknown as are the
properties that make them similar.
7
Lecture Topics
  1. General Concept of Clustering
  2. Define Types of Data
  3. Review the K-Means Clustering Algorithm
  4. Describe Clustering Method for non- Quantitative
    data
  5. Present two examples illustrating the method
  6. Discuss advantages and disadvantages of the
    method

8
Mathematical Formalization of the Clustering
Problem
Given a set S of NS n-dimensional pattern
vectors
S xj j 1, 2, ... , NS
Clustering is the process of partitioning S into
M subsets, Clk , k1, 2, ... , M, called
clusters, that satisfy the following conditions.
9
Properties of a Clustering Partition
1. The members in each subset are in some sense
similar and not similar to members in the other
subsets.
2. Clk ? F Not empty
3. Clk n Clj ? F Pairwise disjoint
S Exhaustive
4.
F is the Null Set
10
Illustration of Clusters and Cluster centers
11
For quantitative data we use measures of
similarity and dissimilarity between pattern
samples and clusters
Euclidean Distance between two pattern vectors x
and y
The smaller the distance the larger the similarity
12
A few Methods for Clustering
of Quantitative Data are
1. K-Means Clustering Algorithm 2. Hierarchical
Clustering Algorithm 3. ISODATA Clustering
Algorithm 4. Fuzzy Clustering Algorithm

13
K-Means Clustering Algorithm
Basic Procedure
Randomly Select K cluster centers from Pattern
Space or data set Distribute set of patterns to
the cluster center using minimum distance
Compute new Cluster centers for each
cluster Continue this process until the cluster
centers do not change.
14
Three Main Types of Data 1. Quantitative
2. Qualitative or categorical 3. Mixed
Quantitative and Qualitative
The first two types can be further broken down
into special categories as follows
15
Quantitative Data
16
Qualitative or Categorical Data
17
Nominal Non-Binary Examples Color of
eyes blue, brown, black,
green, gray Car Owned
Ford, Toyota, Kia, Renault, Mercedes
Nominal Binary Examples Answer to
true false question True, False
Position of a switch on, off
18
Linearly ordered Ordinal Qualitative Answer
to a persons health excellent, very
good, average, fair, poor Hierarchical
or Structurally Ordered Qualitative
Answer to type of figure rectangle, triangle,
hexagon, circle, ellipse
19
Hierarchical Structured Qualitative Data- Answer
to type of figure
20
Lattice Structurally Ordered Qualitative
Answer to type of education Elementary
School, High school, Apprenticeship,
Undergraduate school, on the job training,
Graduate school, post graduate school
21
Measure of Performance for Clustering of
Categorical Data Overall performance measure J
for a given set of clusters Clk for k 1, 2, ...
, K
Mk
is the kth cluster Representative Vector
where
i
k
i
k
... is the measure of distance for data
vectors
22
We wish to minimize J by the selection of the
representative elements of each cluster and the
elements of each cluster. ( the partition of the
data set) This overall performance measure J
can be minimized by a two-stage iterative process
similar to the steps given in the standard
K-Means algorithm.
23
Proposed Modified K-Means Clustering
for Qualitative Data
- Basic Procedure
Randomly Select K cluster centers from Pattern
Space or data set Distribute set of patterns to
the cluster center using minimum distance
Compute new Cluster centers for each
cluster Continue this process until the cluster
centers do not change.
24
Proposed Modified K-Means Algorithm
(Step 0) Selection of Initial Cluster Centers.
There are many ways to select the initial
cluster centers. Perhaps the
simplest way is to select a set of
sequences in the data set randomly.
25
(Step 1) Redistribution of Sequences to Cluster
Centers We have chosen to redistribute each
sequence to the cluster center that is its
nearest neighbor. Thus, each vector is assigned
to the closest cluster center where closest is
with respect to some predefined distance measure.
26
(Step 2) Selection of Cluster Centers
Choose the sequence in the cluster that has
the smallest sum of distances from the
sequence to all other sequences in the
cluster. Resolve ties randomly
The fact that the cluster center is selected in
this way, always a member of the data set,
contrasts with the standard K-means algorithm for
numerical data, where the cluster center is not
necessarily a member of the original data set
because it represents the average of the points
in each cluster.
Steps (1) and (2) are repeated until the cluster
centers do not change
27
Examples Using proposed clustering Algorithm
Example 1 Structural data with missing
components
Example 2 Archaeological Sequential Data
28
Example 1 Missing data sequential
Letting b equal an unknown symbol the above can
be written as
Use the modified K-Means clustering algorithm to
obtain two clusters of the data set.
29
Solution First define a measure of distance
between members of the data sets as
Subjective assignment
Next randomly select cluster centers
30
Redistribute the samples to the cluster centers
Using the defined distance measure assign to the
nearest cluster center
This yields the following new trial clusters
31
Determine new cluster Center for cluster 1using
minimum row sum
Therefore since row three has the smallest sum
the new Cluster 1 Center becomes
32
Determine new cluster Center for Cluster 2 using
minimum row sum
Therefore since row two has the smallest sum the
new Cluster 2 center becomes
33
Redistribute to obtain
34
Determine new cluster Center for Cluster 1 using
minimum row sum
Therefore since row three has the smallest sum
the new Cluster 1 center becomes
35
Determine new cluster Center for Cluster 2 using
minimum row sum
These are same clustering centers as previous
iteration thus the final clustering becomes
36
Tree of possible sequences for Example 1
R
R
37
Example 2 Clustering of Archaeological data
Sample 1 Sample 2
Sherds general fill general fill sherds
whole pots roof fall roof fall wall
fall wall fall whole pots floor artifacts
floor artifacts
Depositional Sequence
38
Archaeological Categorical Sequential Data
Euclidean Distance is no longer a meaningful or
for that matter a computable measure of distance.
Thus, new intra-set distance measures must be
defined as well as different methods for
selecting representative elements or cluster
centers. Techniques for clustering different
size vectors or vectors containing sequential
relationships have not received the attention of
researchers, perhaps because software is limited
to conveniently handle this type of problem.
39
The following code was set up for this example
N Floors with few or no artifacts (lt10) M
Floors with many artifacts (gt10) U Layer of
unburned roofing material B Layer of burned
roofing. T Refuse S Deposits of
aeolian sand D Detritus from the cave roof
? Unknown deposits
40
Given the Broken Flute Cave Strata Data Set
ID Seq ID Seq ID Seq
x1 NB x5 NSBT x8a NSB
x2 NB x6 MBT x9 MBTD
x3 NBT x7 MBT x11 NSUT
x4 - NBT x8 MBT x12 NBT
Find four clusters that characterize the data
41
Solution
Define a distance measure as the minimum weighted
number of changes or steps required to transform
one stratigraphic sequence into another.
Transformation rules (1) addition or
deletion of a stratum, (2) changes in
kind of stratum, and (3) order of strata.
Reversal of order, Use differentially weighted
measures for transformations as follow
42
Weights for various transformations
  • The addition or deletion of a stratum, e.g.,
    adding sand (S) or deleting trash (T) were
    weighted the least. Such transformations were
    assigned a distance of 1 unit.
  • 2) Changes in kind e.g., burned roofing (B) vs.
    unburned roofing (U) strata, were more heavily
    weighted. Each was assigned a distance of 1.5
    units.

43
  • 3) Had we encountered reversals of order, e.g.,
    burned roofing over sand (B S) vs. sand deposited
    over burned roofing (S B), we would have weighted
    them heaviest assigning each a distance of 2
    units.
  • Reason for weighing reversals highly is
    because they represent a significantly different
    behavioral and depositional sequence.

44
Distances Between Stratigraphic Sequences at
Broken Flute Cave
x1 x2 x3 x4 x5 x6 x7 x8 x8a x9 x11 x12 sum
x1 0 0 1 1 3 2.5 2.5 2.5 2 3.5 3 1 21.5
x2 0 0 1 1 3 2.5 2.5 2.5 2 3.5 3 1 21.5
x3 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
x4 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
x5 3 3 3 3 0 4.5 4.5 4.5 1 4.5 1 3 35
x6 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x7 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x8 2.5 2.5 1.5 1.5 4.5 0 0 0 3.5 1 4.5 1.5 22
x8a 2 2 3 2 1 3.5 3.5 3.5 0 4.5 2.5 2 29.5
x9 3.5 3.5 2.5 2.5 4.5 1 1 1 4.5 0 4.5 2.5 31
x11 3 3 3 3 1 4.5 4.5 4.5 2.5 4.5 0 3 35.5
x12 1 1 0 0 3 1.5 1.5 1.5 3 2.5 3 0 18
45
Selection of Initial Cluster Centers We chose the
four initial cluster centers randomly as x11, x5
, x9, x8a .
cc1(1) x11 cc2(1) x5 cc3(1) x9 cc4(1) x8a Assign
x1 3 3 3.5 2 cl4(1)
x2 3 3 3.5 2 cl4(1)
x3 3 3 2.5 3 cl3(1)
x4 3 3 2.5 3 cl3(1)
x5 1 0 4.5 1 cl2(1)
x6 4.5 4.5 1 3.5 cl3(1)
x7 4.5 4.5 1 3.5 cl3(1)
x8 4.5 4.5 1 3.5 cl3(1)
x8a 2.5 1 4.5 0 cl4(1)
x9 4.5 4.5 0 4.5 cl3(1)
x11 0 1 4.5 2.5 cl1(1)
x12 3 3 2.5 3 cl3(1)
46
Continuing with the iterations until convergence
gives the final four clusters as
Cl1 x11 NSUT Cl2 x5, x8a
NSBT, NSB Cl3 x6, x7, x8, x9 MBT, MBT,
MBT, MBTD Cl4 x1, x2, x3, x4, x12 NB,
NB, NBT, NBT, NBT
We will not take time to discuss the
archaeological significance of this result.
47
Advantages of the proposed method
1. Can obtain clusters for non quantitative data
typical of archaeological data obtained in field
work 2. Since the new cluster center of each
cluster is always a member of the data, distances
between samples need only be computed once and
recalled from memory when needed reducing
computation. 3. Rerunning algorithm provides
different interpretations of the data. 4. Some
structural information is provided by the
resulting cluster centers
48
Disadvantages of the proposed method
1. Can converge to a local minimum 2. Must be run
several times with different random initial
cluster centers 3. Results are dependent on the
subjective distance measures used. 4. Will
always find clusters whether there are any
physical clusters existing or not. 5. With very
large data set ( N data points ) requires storage
of N(N-1)/2 distances or recalculation of
distances.
49
Lecture Summary
  1. Discussed the General Concept of Clustering
  2. Presented definitions of Types of Data
  3. Reviewed the K-Means Clustering Algorithm
  4. Described Clustering Method for non-
    Quantitative data
  5. Presented two examples illustrating the method.
  6. Discussed advantages and disadvantages of the
    proposed clustering method

50
Thank you for your attention and I am happy to
answer any questions you might have regarding
this presentation.
51
End of Lecture
Write a Comment
User Comments (0)
About PowerShow.com