Title: Clustering.
1Lecture 5COMP4044 Data Mining and Machine
LearningCOMP5318 Knowledge Discovery and Data
Mining
- Clustering.
- K-means. Nearest Neighbor.
- Hierarchical clustering.
- Reference Dunham 125-142
2Outline
- Introduction to clustering
- Examples
- Taxonomy of clustering algorithms
- What is a good clustering
- Characteristics of a cluster
- Distance between clusters
- K-means clustering algorithm
- Nearest Neighbor clustering algorithm
- Hierarchical clustering
- Agglomerative hierarchical algorithms
- Single link, complete link, average link
- Divisive hierarchical algorithm
3What is Clustering?
- Clustering the process of grouping the data
into classes (clusters) so that the data objects
(examples) are - similar to one another within the same cluster
- dissimilar to the objects in other clusters
- Clustering is unsupervised classification no
predefined classes - Given A set of unlabeled examples (input
vectors) pi k desired number of clusters - Task Cluster (group) the examples into k clusters
4Clustering Formal Definition
- DEF Given a database Pp1,, pn of tuples
(items, records, examples, instances) and an
integer k, the clustering problem is to define a
mapping f P-gt1,,k where each pi is assigned
to one cluster Kj, 1ltjltk - Result of solving a clustering problem a set of
clusters KKk1,K2,,Kk
5Typical Clustering Applications
- As a stand-alone tool to
- get insight into data distribution
- find the characteristics of each cluster
- assign the cluster of a new example
- As a preprocessing step for other algorithms
- e.g. dimensionality reduction using cluster
centers to represent data in clusters
6Clustering Example - Stars
- Star clustering based on temperature and
brightness (Hertzsprung-Russel diagram) - The 3 clusters represent stars in 3 different
phases of their life - Astronomers had to perform clustering to identify
these categories - Well-defined clusters
From Data Mining Techniques, M. Berry, G.
Linoff, John Wiley and Sons Publ.
7Clustering Example - Houses
- Given dataset may be clustered on different
attributes
8Clustering Example - Animals
- 16 animals described with 13 binary attributes
9Clustering Example Fitting Troops
- Fitting the troops re-design of uniforms for
female soldiers in US army - Goal reduce the number of uniform sizes to be
kept in inventory while still providing good fit - Researchers from Cornell University used
clustering and designed a new set of sizes - Traditional clothing size system ordered set of
graduated sizes where all dimensions increase
together - The new system sizes that fit body types
- E.g. one size for short-legged, small waisted,
women with wide and long torsos, average arms,
broad shoulders, and skinny necks
10Other Examples of Clustering Applications
- Marketing
- help discover distinct groups of customers, and
then use this knowledge to develop targeted
marketing programs - Biology
- derive plant and animal taxonomies
- find genes with similar function
- Land use
- identify areas of similar land use in an earth
observation database - Insurance
- identify groups of motor insurance policy holders
with a high average claim cost - City-planning
- identify groups of houses according to their
house type, value, and geographical location
11Clustering Important Features
- The best number of clusters is not known
- There is no one correct answer to a clustering
problem - domain expert may be required
- Interpreting the semantic meaning of each cluster
is difficult - What are the characteristics that the items have
in common? - Domain expert is needed
- Cluster results are dynamic (change over time) if
data is dynamic - e.g. clustering web logs for patterns of usage
12Taxonomy of Clustering Algorithms
13Classification of Clustering Algorithms cont.
- Hierarchical clustering
- create a nested set of clusters
- each level in the hierarchy has a separate set of
clusters - lowest level each item is in one cluster
- highest level all items form one cluster
- The desired number of clusters k is not an input
- Agglomerative bottom-up creation of the
clusters - Divisive top-down
From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
14Classification of Clustering Algorithms cont.2
- Partitional
- create only one set of clusters
- Require the number of clusters k to be
pre-specified - Examples k-means, nearest-neighboring,
Self-Organising Maps (SOM)
Clustering using SOM
From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
15Classification of Clustering Algorithms cont.3
- Categorical and large DB algorithms
- Traditional algorithms do no deal with
categorical (nominal) data and also are typically
applied to small data sets that fit in the memory - Some of the recent clustering algorithms address
these issues (typically by sampling the data or
using efficient data structures) - Other criteria to classify clustering algorithms
- Produce overlapping or non-overlapping clusters
- Serial (incremental) or simultaneous
- items are examined one by one or together
- Monothetic and Polythetic
- Examine one or many attribute values at a time
16What is a Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The similarity is measured using a distance
function - e.g. Davies-Bouldin index a heuristic measure
of the quality of the clustering clusters are
compared in pairs - c number of clusters
- D(xi) mean-squared distance from the points in
the cluster i to the center - D(xi,xj) distance between the centers of
cluster i and j - What is the DB index for a good clustering big
or small?
17Characteristics of a Cluster
- Consider a cluster K of N points p1,..,pN
- Centroid the middle of the cluster
- no need to be an actual data point in the cluster
- Medoid M the centrally located data point
(object) in the cluster - Radius square root of the average mean squared
distance from any point in the cluster to the
centroid - Diameter square root of the average mean
squared distance between all pairs of points in
the cluster
18Distance Between Clusters
Many interpretations
- Single link the distance between 2 clusters is
the smallest distance between an element in one
cluster and an element in the other - Complete link the largest distance between an
element in one cluster and an element in the
other - Average link the average distance between each
element in one cluster and each element in the
other - Centroid the distance between the centroids
- Medoid the distance between the medoids
19Different Ways of Visualizing Clusters
1 2 3 a 0.4 0.1
0.5 b 0.1 0.8 0.1 c
0.3 0.3 0.4 d 0.1 0.1
0.8 e 0.4 0.2 0.4 f 0.1 0.4
0.5 g 0.7 0.2 0.1 h
0.5 0.4 0.1
20K-Means Clustering Algorithm
- Simple and very popular clustering algorithm
- It is an iterative distance-based partitional
clustering method - Requires the number of clusters k to be specified
in advance - Can be implemented in 4 steps
- 1. Choose k seeds (vectors with the same
dimensionality as the input examples typically
the first k examples are selected as seeds) - 2. Apply an example, calculate the distance from
it to all seeds and assign it to the cluster with
the nearest seed point - 3. At the end of each epoch compute the centroid
(means) of the clusters - 4. If the stopping criteria is satisfied (no
changes in the assignment of the examples or max
number of epochs reached), stop. Otherwise,
repeat 2 and 3 with the new centroids taking the
role of the seeds.
21K-Means Algorithm - Example
- What is the output of k-means?
- How can we use it to find the cluster of a new
example?
22K-Means Algorithm Pseudo Code
23K-means - Issues
- Different distance measures can be used
- typically Eucledian distance is used
- Data should be normalized
- Typically produces good results
- Computationally expensive, does not scale well
- Involves finding the distance from each example
to each cluster center at each iteration - Time complexity O(tkn), t- number of iterations,
k-number of clusters, n- number of examples - Not optimal finds a local optimum, may miss the
global one - Standard k-means does not work on nominal data
- Calculating distance for nominal feature vectors
- Defining means on the attribute type
- There are variations of k-means that handle
nominal data (e.g. k-modes)
24K-means Issues (cont.)
- What type of clusters does k-means produce
convex-shaped or non-convex shaped? What shape? - Convex region (hull) region in which any point
can be connected to any other by a straight line
that does not cross the boundary of the region
25K-means Variations
- Improving the chances of k-means to find the
global minimum - Different ways to initialize the seeds
- Careful selection of the number of clusters
- Using weights based on how close the example is
to the cluster center Gaussian mixture models - Allowing clusters to split and merge
- Split if the variance within a cluster is large
- Merge if the distance between cluster centers is
smaller than a threshold - Make it scale better
- Save distance information from one iteration to
the next, thus reducing the number of
calculations - Typical values of k 2 to 10
- K-means can be used for hierarchical clustering
- Start with k2 and repeat recursively within each
cluster
26Nearest Neighbor Clustering Algorithm
- A new instance forms a new cluster or is merged
to an existing one depending on how close it is
to the existing clusters - threshold t to determine if to merge or create a
new cluster
// t1 is placed in a cluster by itself
// t2-tn items add to an existing or place in
a new cluster?
- Time complexity O(n2), n-number of items
- Each item is compared to each item already in the
cluster
27Nearest Neighbor Clustering - Example
- Given 5 items with the distance between them
- Task Cluster them using the Nearest Neighbor
algorithm with a threshold t2
-A K1A -B d(B,A)1ltt gt K1A,B -C
d(C,Ad(C,B)2?tgtK1A,B,C -D d(D,A)2,
d(D,B)4, d(D,C)1 dmin ?t gt K1A,B,C,D -E
d(E,A)3, d(E,B)3, d(E, C)5, d(E,
D)3dmingttgtK2E
28Hierarchical Clustering
- Creates not one set of clusters but several sets
of clusters - The desired number of clusters k is not an input
- The hierarchy of clusters can be represented as a
tree structure called dendrogram - Leaves of the dendrogram consist of 1 item
- each item is in one cluster
- Root of the dendrogram contains all items
- all items form one cluster
- Internal nodes represent clusters formed by
merging the clusters of the children - Each level is associated with a distance
threshold that was used to merge the clusters - If the distance between 2 clusters was smaller
than the threshold they were merged
29Dendrogram Representation
- A set of ordered triples (d,k,K)
- d threshold value
- k number of clusters
- K the set of clusters
- Example
- ( 0, 5, A,B,C, D, E ),
- (1, 3, A,B, C,D, E ),
- (2, 2, A,B,C,D, E ),
- (3, 1, A,B,C,D,E )
- Thus, the output is not one set of clusters but
several. One can determine which of the sets to
use.
30Agglomerative vs Divisive Clustering
- Agglomerative
- Start with each item in its own cluster
iteratively merge clusters until all items belong
to one cluster - Merging is based on how close the clusters are to
each other - Calculating distance between clusters - single
link, complete link, average link - Distance threshold d if the distance between two
clusters is smaller or equal to d, merge them - Initially d is set to a small value that is
incremented at each level - Divisive
- Place all items in one cluster iteratively split
clusters in two until all items are in their own
cluster - Splitting is based on the distance between
clusters split if the distance is smaller or
equal to the threshold d - Initially d is set to a big value that is
decremented at each level
31Agglomerative Algorithms Pseudo Code
- Different algorithms merge clusters at each level
differently (procedure NewClusters)
- Merge only 2 or more clusters?
- If there are several clusters with identical
distances, which ones to merge? - How to determine the distance between clusters?
- single link
- complete link
- average link
32NewClusters Procedure
- NewClusters typically finds all the clusters that
are within distance d from each other (according
to the distance measure used), merges them and
updates the adjacency matrix - Example
- Given 5 items with the distance between them
- Task Cluster them using agglomerative single
link clustering
33Example Solution 1
- Distance Level 3 merge ABCDE all items are in
one cluster stop - Dendrogram
- Distance Level 1 merge AB, CD update the
adjacency matrix
- Distance Level 2 merge ABCD update the
adjacency matrix
34Single Link Algorithm as a Graph Problem
- NewClusters can be replaced with a procedure for
finding connected components in a graph - two components of a graph are connected if there
exists a path between any 2 vertices - Examples
- A and B are connected, A, B, C and D are
connected - C and D are connected
- Show the graph edges with a distance of d or
below - Merge 2 clusters if there is at least 1 edge that
connects them (i.e. if the minimum distance
between any 2 points is lt d) - Increment d
35Example Solution 2
- Procedure NewClusters
- Input graph defined by a set of vertices
vertex adjacency matrix - Output a set of connected components defined by
a number of these components (i.e. number of
clusters k) and an array with the membership of
these components (i.e. K - the set of clusters)
Single link dendrogram
36Single Link vs. Complete Link Algorithm
- Single link suffers from the so called chain
effect - 2 clusters are merged if only 2 of their points
are close to each other - there may be points in the 2 clusters that are
far from each other but this has no effect on the
algorithm - Thus the clusters may contain points that are not
related to each other but simply happen to be
near points that are close to each other - Complete link the distance between 2 clusters
is the largest distance between an element in one
cluster and an element in another - Generates more compact clusters
- Dendrogram for the example
37Average Link
- Average link - the the distance between 2
clusters is the average distance between an
element in one cluster and an element in another
For our example
arbitrary set can be 1
38Divisive Clustering
- All items are initially placed in one cluster
- Clusters are iteratively split in two until all
items are in their own cluster - In reverse order from e to b
39Applicability and Complexity
- Hierarchical clustering algorithms are suitable
for domains with natural nesting relationships
between clusters - Biology- plant and animal taxonomies can be
viewed as a hierarchy of clusters - Space complexity of the algorithm O(n2), n -
number of items - The space required to store the adjacency
distance matrix - Space complexity of the dendrogram O(kn),
k-number of levels - Time complexity of the algorithm O(kn2) 1
iteration for each level of the dendrogram - Not incremental assume all data is present