Clustering.

About This Presentation

Title:

Clustering.

Description:

COMP4044 Data Mining and Machine Learning. COMP5318 Knowledge Discovery and ... Star clustering based on temperature and brightness (Hertzsprung-Russel diagram) ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 40

Provided by: Tzvetomir4

Category:

more less

Transcript and Presenter's Notes

Title: Clustering.

1
Lecture 5COMP4044 Data Mining and Machine
LearningCOMP5318 Knowledge Discovery and Data
Mining

Clustering.
K-means. Nearest Neighbor.
Hierarchical clustering.
Reference Dunham 125-142

2
Outline

Introduction to clustering
Examples
Taxonomy of clustering algorithms
What is a good clustering
Characteristics of a cluster
Distance between clusters
K-means clustering algorithm
Nearest Neighbor clustering algorithm
Hierarchical clustering
Agglomerative hierarchical algorithms
Single link, complete link, average link
Divisive hierarchical algorithm

3
What is Clustering?

Clustering the process of grouping the data
into classes (clusters) so that the data objects
(examples) are
similar to one another within the same cluster
dissimilar to the objects in other clusters
Clustering is unsupervised classification no
predefined classes
Given A set of unlabeled examples (input
vectors) pi k desired number of clusters
Task Cluster (group) the examples into k clusters

4
Clustering Formal Definition

DEF Given a database Pp1,, pn of tuples
(items, records, examples, instances) and an
integer k, the clustering problem is to define a
mapping f P-gt1,,k where each pi is assigned
to one cluster Kj, 1ltjltk
Result of solving a clustering problem a set of
clusters KKk1,K2,,Kk

5
Typical Clustering Applications

As a stand-alone tool to
get insight into data distribution
find the characteristics of each cluster
assign the cluster of a new example
As a preprocessing step for other algorithms
e.g. dimensionality reduction using cluster
centers to represent data in clusters

6
Clustering Example - Stars

Star clustering based on temperature and
brightness (Hertzsprung-Russel diagram)
The 3 clusters represent stars in 3 different
phases of their life
Astronomers had to perform clustering to identify
these categories
Well-defined clusters

From Data Mining Techniques, M. Berry, G.
Linoff, John Wiley and Sons Publ.
7
Clustering Example - Houses

Given dataset may be clustered on different
attributes

8
Clustering Example - Animals

16 animals described with 13 binary attributes

9
Clustering Example Fitting Troops

Fitting the troops re-design of uniforms for
female soldiers in US army
Goal reduce the number of uniform sizes to be
kept in inventory while still providing good fit
Researchers from Cornell University used
clustering and designed a new set of sizes
Traditional clothing size system ordered set of
graduated sizes where all dimensions increase
together
The new system sizes that fit body types
E.g. one size for short-legged, small waisted,
women with wide and long torsos, average arms,
broad shoulders, and skinny necks

10
Other Examples of Clustering Applications

Marketing
help discover distinct groups of customers, and
then use this knowledge to develop targeted
marketing programs
Biology
derive plant and animal taxonomies
find genes with similar function
Land use
identify areas of similar land use in an earth
observation database
Insurance
identify groups of motor insurance policy holders
with a high average claim cost
City-planning
identify groups of houses according to their
house type, value, and geographical location

11
Clustering Important Features

The best number of clusters is not known
There is no one correct answer to a clustering
problem
domain expert may be required
Interpreting the semantic meaning of each cluster
is difficult
What are the characteristics that the items have
in common?
Domain expert is needed
Cluster results are dynamic (change over time) if
data is dynamic
e.g. clustering web logs for patterns of usage

12
Taxonomy of Clustering Algorithms
13
Classification of Clustering Algorithms cont.

Hierarchical clustering
create a nested set of clusters
each level in the hierarchy has a separate set of
clusters
lowest level each item is in one cluster
highest level all items form one cluster

The desired number of clusters k is not an input
Agglomerative bottom-up creation of the
clusters
Divisive top-down

From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
14
Classification of Clustering Algorithms cont.2

Partitional
create only one set of clusters
Require the number of clusters k to be
pre-specified
Examples k-means, nearest-neighboring,
Self-Organising Maps (SOM)

Clustering using SOM
From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
15
Classification of Clustering Algorithms cont.3

Categorical and large DB algorithms
Traditional algorithms do no deal with
categorical (nominal) data and also are typically
applied to small data sets that fit in the memory
Some of the recent clustering algorithms address
these issues (typically by sampling the data or
using efficient data structures)
Other criteria to classify clustering algorithms
Produce overlapping or non-overlapping clusters
Serial (incremental) or simultaneous
items are examined one by one or together
Monothetic and Polythetic
Examine one or many attribute values at a time

16
What is a Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The similarity is measured using a distance
function
e.g. Davies-Bouldin index a heuristic measure
of the quality of the clustering clusters are
compared in pairs
c number of clusters
D(xi) mean-squared distance from the points in
the cluster i to the center
D(xi,xj) distance between the centers of
cluster i and j
What is the DB index for a good clustering big
or small?

17
Characteristics of a Cluster

Consider a cluster K of N points p1,..,pN
Centroid the middle of the cluster
no need to be an actual data point in the cluster
Medoid M the centrally located data point
(object) in the cluster
Radius square root of the average mean squared
distance from any point in the cluster to the
centroid
Diameter square root of the average mean
squared distance between all pairs of points in
the cluster

18
Distance Between Clusters
Many interpretations

Single link the distance between 2 clusters is
the smallest distance between an element in one
cluster and an element in the other
Complete link the largest distance between an
element in one cluster and an element in the
other
Average link the average distance between each
element in one cluster and each element in the
other
Centroid the distance between the centroids
Medoid the distance between the medoids

19
Different Ways of Visualizing Clusters
1 2 3 a 0.4 0.1
0.5 b 0.1 0.8 0.1 c
0.3 0.3 0.4 d 0.1 0.1
0.8 e 0.4 0.2 0.4 f 0.1 0.4
0.5 g 0.7 0.2 0.1 h
0.5 0.4 0.1
20
K-Means Clustering Algorithm

Simple and very popular clustering algorithm
It is an iterative distance-based partitional
clustering method
Requires the number of clusters k to be specified
in advance
Can be implemented in 4 steps
1. Choose k seeds (vectors with the same
dimensionality as the input examples typically
the first k examples are selected as seeds)
2. Apply an example, calculate the distance from
it to all seeds and assign it to the cluster with
the nearest seed point
3. At the end of each epoch compute the centroid
(means) of the clusters
4. If the stopping criteria is satisfied (no
changes in the assignment of the examples or max
number of epochs reached), stop. Otherwise,
repeat 2 and 3 with the new centroids taking the
role of the seeds.

21
K-Means Algorithm - Example

What is the output of k-means?
How can we use it to find the cluster of a new
example?

22
K-Means Algorithm Pseudo Code
23
K-means - Issues

Different distance measures can be used
typically Eucledian distance is used
Data should be normalized
Typically produces good results
Computationally expensive, does not scale well
Involves finding the distance from each example
to each cluster center at each iteration
Time complexity O(tkn), t- number of iterations,
k-number of clusters, n- number of examples
Not optimal finds a local optimum, may miss the
global one
Standard k-means does not work on nominal data
Calculating distance for nominal feature vectors
Defining means on the attribute type
There are variations of k-means that handle
nominal data (e.g. k-modes)

24
K-means Issues (cont.)

What type of clusters does k-means produce
convex-shaped or non-convex shaped? What shape?
Convex region (hull) region in which any point
can be connected to any other by a straight line
that does not cross the boundary of the region

25
K-means Variations

Improving the chances of k-means to find the
global minimum
Different ways to initialize the seeds
Careful selection of the number of clusters
Using weights based on how close the example is
to the cluster center Gaussian mixture models
Allowing clusters to split and merge
Split if the variance within a cluster is large
Merge if the distance between cluster centers is
smaller than a threshold
Make it scale better
Save distance information from one iteration to
the next, thus reducing the number of
calculations
Typical values of k 2 to 10
K-means can be used for hierarchical clustering
Start with k2 and repeat recursively within each
cluster

26
Nearest Neighbor Clustering Algorithm

A new instance forms a new cluster or is merged
to an existing one depending on how close it is
to the existing clusters
threshold t to determine if to merge or create a
new cluster

// t1 is placed in a cluster by itself
// t2-tn items add to an existing or place in
a new cluster?

Time complexity O(n2), n-number of items
Each item is compared to each item already in the
cluster

27
Nearest Neighbor Clustering - Example

Given 5 items with the distance between them
Task Cluster them using the Nearest Neighbor
algorithm with a threshold t2

-A K1A -B d(B,A)1ltt gt K1A,B -C
d(C,Ad(C,B)2?tgtK1A,B,C -D d(D,A)2,
d(D,B)4, d(D,C)1 dmin ?t gt K1A,B,C,D -E
d(E,A)3, d(E,B)3, d(E, C)5, d(E,
D)3dmingttgtK2E
28
Hierarchical Clustering

Creates not one set of clusters but several sets
of clusters
The desired number of clusters k is not an input
The hierarchy of clusters can be represented as a
tree structure called dendrogram
Leaves of the dendrogram consist of 1 item
each item is in one cluster
Root of the dendrogram contains all items
all items form one cluster
Internal nodes represent clusters formed by
merging the clusters of the children
Each level is associated with a distance
threshold that was used to merge the clusters
If the distance between 2 clusters was smaller
than the threshold they were merged

29
Dendrogram Representation

A set of ordered triples (d,k,K)
d threshold value
k number of clusters
K the set of clusters
Example
( 0, 5, A,B,C, D, E ),
(1, 3, A,B, C,D, E ),
(2, 2, A,B,C,D, E ),
(3, 1, A,B,C,D,E )
Thus, the output is not one set of clusters but
several. One can determine which of the sets to
use.

30
Agglomerative vs Divisive Clustering

Agglomerative
Start with each item in its own cluster
iteratively merge clusters until all items belong
to one cluster
Merging is based on how close the clusters are to
each other
Calculating distance between clusters - single
link, complete link, average link
Distance threshold d if the distance between two
clusters is smaller or equal to d, merge them
Initially d is set to a small value that is
incremented at each level
Divisive
Place all items in one cluster iteratively split
clusters in two until all items are in their own
cluster
Splitting is based on the distance between
clusters split if the distance is smaller or
equal to the threshold d
Initially d is set to a big value that is
decremented at each level

31
Agglomerative Algorithms Pseudo Code

Different algorithms merge clusters at each level
differently (procedure NewClusters)

Merge only 2 or more clusters?
If there are several clusters with identical
distances, which ones to merge?
How to determine the distance between clusters?
single link
complete link
average link

32
NewClusters Procedure

NewClusters typically finds all the clusters that
are within distance d from each other (according
to the distance measure used), merges them and
updates the adjacency matrix
Example
Given 5 items with the distance between them
Task Cluster them using agglomerative single
link clustering

33
Example Solution 1

Distance Level 3 merge ABCDE all items are in
one cluster stop
Dendrogram

Distance Level 1 merge AB, CD update the
adjacency matrix

Distance Level 2 merge ABCD update the
adjacency matrix

34
Single Link Algorithm as a Graph Problem

NewClusters can be replaced with a procedure for
finding connected components in a graph
two components of a graph are connected if there
exists a path between any 2 vertices
Examples
A and B are connected, A, B, C and D are
connected
C and D are connected
Show the graph edges with a distance of d or
below
Merge 2 clusters if there is at least 1 edge that
connects them (i.e. if the minimum distance
between any 2 points is lt d)
Increment d

35
Example Solution 2

Procedure NewClusters
Input graph defined by a set of vertices
vertex adjacency matrix
Output a set of connected components defined by
a number of these components (i.e. number of
clusters k) and an array with the membership of
these components (i.e. K - the set of clusters)

Single link dendrogram
36
Single Link vs. Complete Link Algorithm

Single link suffers from the so called chain
effect
2 clusters are merged if only 2 of their points
are close to each other
there may be points in the 2 clusters that are
far from each other but this has no effect on the
algorithm
Thus the clusters may contain points that are not
related to each other but simply happen to be
near points that are close to each other
Complete link the distance between 2 clusters
is the largest distance between an element in one
cluster and an element in another
Generates more compact clusters
Dendrogram for the example

37
Average Link

Average link - the the distance between 2
clusters is the average distance between an
element in one cluster and an element in another

For our example
arbitrary set can be 1
38
Divisive Clustering

All items are initially placed in one cluster
Clusters are iteratively split in two until all
items are in their own cluster
In reverse order from e to b

39
Applicability and Complexity

Hierarchical clustering algorithms are suitable
for domains with natural nesting relationships
between clusters
Biology- plant and animal taxonomies can be
viewed as a hierarchy of clusters
Space complexity of the algorithm O(n2), n -
number of items
The space required to store the adjacency
distance matrix
Space complexity of the dendrogram O(kn),
k-number of levels
Time complexity of the algorithm O(kn2) 1
iteration for each level of the dendrogram
Not incremental assume all data is present