Clustering. - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Clustering.

Description:

COMP4044 Data Mining and Machine Learning. COMP5318 Knowledge Discovery and ... Star clustering based on temperature and brightness (Hertzsprung-Russel diagram) ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 40
Provided by: Tzvetomir4
Category:

less

Transcript and Presenter's Notes

Title: Clustering.


1
Lecture 5COMP4044 Data Mining and Machine
LearningCOMP5318 Knowledge Discovery and Data
Mining
  • Clustering.
  • K-means. Nearest Neighbor.
  • Hierarchical clustering.
  • Reference Dunham 125-142

2
Outline
  • Introduction to clustering
  • Examples
  • Taxonomy of clustering algorithms
  • What is a good clustering
  • Characteristics of a cluster
  • Distance between clusters
  • K-means clustering algorithm
  • Nearest Neighbor clustering algorithm
  • Hierarchical clustering
  • Agglomerative hierarchical algorithms
  • Single link, complete link, average link
  • Divisive hierarchical algorithm

3
What is Clustering?
  • Clustering the process of grouping the data
    into classes (clusters) so that the data objects
    (examples) are
  • similar to one another within the same cluster
  • dissimilar to the objects in other clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Given A set of unlabeled examples (input
    vectors) pi k desired number of clusters
  • Task Cluster (group) the examples into k clusters

4
Clustering Formal Definition
  • DEF Given a database Pp1,, pn of tuples
    (items, records, examples, instances) and an
    integer k, the clustering problem is to define a
    mapping f P-gt1,,k where each pi is assigned
    to one cluster Kj, 1ltjltk
  • Result of solving a clustering problem a set of
    clusters KKk1,K2,,Kk

5
Typical Clustering Applications
  • As a stand-alone tool to
  • get insight into data distribution
  • find the characteristics of each cluster
  • assign the cluster of a new example
  • As a preprocessing step for other algorithms
  • e.g. dimensionality reduction using cluster
    centers to represent data in clusters

6
Clustering Example - Stars
  • Star clustering based on temperature and
    brightness (Hertzsprung-Russel diagram)
  • The 3 clusters represent stars in 3 different
    phases of their life
  • Astronomers had to perform clustering to identify
    these categories
  • Well-defined clusters

From Data Mining Techniques, M. Berry, G.
Linoff, John Wiley and Sons Publ.
7
Clustering Example - Houses
  • Given dataset may be clustered on different
    attributes

8
Clustering Example - Animals
  • 16 animals described with 13 binary attributes

 
 
 
9
Clustering Example Fitting Troops
  • Fitting the troops re-design of uniforms for
    female soldiers in US army
  • Goal reduce the number of uniform sizes to be
    kept in inventory while still providing good fit
  • Researchers from Cornell University used
    clustering and designed a new set of sizes
  • Traditional clothing size system ordered set of
    graduated sizes where all dimensions increase
    together
  • The new system sizes that fit body types
  • E.g. one size for short-legged, small waisted,
    women with wide and long torsos, average arms,
    broad shoulders, and skinny necks

10
Other Examples of Clustering Applications
  • Marketing
  • help discover distinct groups of customers, and
    then use this knowledge to develop targeted
    marketing programs
  • Biology
  • derive plant and animal taxonomies
  • find genes with similar function
  • Land use
  • identify areas of similar land use in an earth
    observation database
  • Insurance
  • identify groups of motor insurance policy holders
    with a high average claim cost
  • City-planning
  • identify groups of houses according to their
    house type, value, and geographical location

11
Clustering Important Features
  • The best number of clusters is not known
  • There is no one correct answer to a clustering
    problem
  • domain expert may be required
  • Interpreting the semantic meaning of each cluster
    is difficult
  • What are the characteristics that the items have
    in common?
  • Domain expert is needed
  • Cluster results are dynamic (change over time) if
    data is dynamic
  • e.g. clustering web logs for patterns of usage

12
Taxonomy of Clustering Algorithms
13
Classification of Clustering Algorithms cont.
  • Hierarchical clustering
  • create a nested set of clusters
  • each level in the hierarchy has a separate set of
    clusters
  • lowest level each item is in one cluster
  • highest level all items form one cluster
  • The desired number of clusters k is not an input
  • Agglomerative bottom-up creation of the
    clusters
  • Divisive top-down

From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
14
Classification of Clustering Algorithms cont.2
  • Partitional
  • create only one set of clusters
  • Require the number of clusters k to be
    pre-specified
  • Examples k-means, nearest-neighboring,
    Self-Organising Maps (SOM)

Clustering using SOM
From Empirical Evaluation of Clustering
Algorithms, A. Rauber, J. Paralic, E. Pampalk,
JIOS, 24(2), 2000.
15
Classification of Clustering Algorithms cont.3
  • Categorical and large DB algorithms
  • Traditional algorithms do no deal with
    categorical (nominal) data and also are typically
    applied to small data sets that fit in the memory
  • Some of the recent clustering algorithms address
    these issues (typically by sampling the data or
    using efficient data structures)
  • Other criteria to classify clustering algorithms
  • Produce overlapping or non-overlapping clusters
  • Serial (incremental) or simultaneous
  • items are examined one by one or together
  • Monothetic and Polythetic
  • Examine one or many attribute values at a time

16
What is a Good Clustering?
  • A good clustering method will produce high
    quality clusters with
  • high intra-class similarity
  • low inter-class similarity
  • The similarity is measured using a distance
    function
  • e.g. Davies-Bouldin index a heuristic measure
    of the quality of the clustering clusters are
    compared in pairs
  • c number of clusters
  • D(xi) mean-squared distance from the points in
    the cluster i to the center
  • D(xi,xj) distance between the centers of
    cluster i and j
  • What is the DB index for a good clustering big
    or small?

17
Characteristics of a Cluster
  • Consider a cluster K of N points p1,..,pN
  • Centroid the middle of the cluster
  • no need to be an actual data point in the cluster
  • Medoid M the centrally located data point
    (object) in the cluster
  • Radius square root of the average mean squared
    distance from any point in the cluster to the
    centroid
  • Diameter square root of the average mean
    squared distance between all pairs of points in
    the cluster

18
Distance Between Clusters
Many interpretations
  • Single link the distance between 2 clusters is
    the smallest distance between an element in one
    cluster and an element in the other
  • Complete link the largest distance between an
    element in one cluster and an element in the
    other
  • Average link the average distance between each
    element in one cluster and each element in the
    other
  • Centroid the distance between the centroids
  • Medoid the distance between the medoids

19
Different Ways of Visualizing Clusters
1 2 3 a 0.4 0.1
0.5 b 0.1 0.8 0.1 c
0.3 0.3 0.4 d 0.1 0.1
0.8 e 0.4 0.2 0.4 f 0.1 0.4
0.5 g 0.7 0.2 0.1 h
0.5 0.4 0.1
20
K-Means Clustering Algorithm
  • Simple and very popular clustering algorithm
  • It is an iterative distance-based partitional
    clustering method
  • Requires the number of clusters k to be specified
    in advance
  • Can be implemented in 4 steps
  • 1. Choose k seeds (vectors with the same
    dimensionality as the input examples typically
    the first k examples are selected as seeds)
  • 2. Apply an example, calculate the distance from
    it to all seeds and assign it to the cluster with
    the nearest seed point
  • 3. At the end of each epoch compute the centroid
    (means) of the clusters
  • 4. If the stopping criteria is satisfied (no
    changes in the assignment of the examples or max
    number of epochs reached), stop. Otherwise,
    repeat 2 and 3 with the new centroids taking the
    role of the seeds.

21
K-Means Algorithm - Example
  • What is the output of k-means?
  • How can we use it to find the cluster of a new
    example?

22
K-Means Algorithm Pseudo Code
23
K-means - Issues
  • Different distance measures can be used
  • typically Eucledian distance is used
  • Data should be normalized
  • Typically produces good results
  • Computationally expensive, does not scale well
  • Involves finding the distance from each example
    to each cluster center at each iteration
  • Time complexity O(tkn), t- number of iterations,
    k-number of clusters, n- number of examples
  • Not optimal finds a local optimum, may miss the
    global one
  • Standard k-means does not work on nominal data
  • Calculating distance for nominal feature vectors
  • Defining means on the attribute type
  • There are variations of k-means that handle
    nominal data (e.g. k-modes)

24
K-means Issues (cont.)
  • What type of clusters does k-means produce
    convex-shaped or non-convex shaped? What shape?
  • Convex region (hull) region in which any point
    can be connected to any other by a straight line
    that does not cross the boundary of the region

25
K-means Variations
  • Improving the chances of k-means to find the
    global minimum
  • Different ways to initialize the seeds
  • Careful selection of the number of clusters
  • Using weights based on how close the example is
    to the cluster center Gaussian mixture models
  • Allowing clusters to split and merge
  • Split if the variance within a cluster is large
  • Merge if the distance between cluster centers is
    smaller than a threshold
  • Make it scale better
  • Save distance information from one iteration to
    the next, thus reducing the number of
    calculations
  • Typical values of k 2 to 10
  • K-means can be used for hierarchical clustering
  • Start with k2 and repeat recursively within each
    cluster

26
Nearest Neighbor Clustering Algorithm
  • A new instance forms a new cluster or is merged
    to an existing one depending on how close it is
    to the existing clusters
  • threshold t to determine if to merge or create a
    new cluster

// t1 is placed in a cluster by itself
// t2-tn items add to an existing or place in
a new cluster?
  • Time complexity O(n2), n-number of items
  • Each item is compared to each item already in the
    cluster

27
Nearest Neighbor Clustering - Example
  • Given 5 items with the distance between them
  • Task Cluster them using the Nearest Neighbor
    algorithm with a threshold t2

-A K1A -B d(B,A)1ltt gt K1A,B -C
d(C,Ad(C,B)2?tgtK1A,B,C -D d(D,A)2,
d(D,B)4, d(D,C)1 dmin ?t gt K1A,B,C,D -E
d(E,A)3, d(E,B)3, d(E, C)5, d(E,
D)3dmingttgtK2E
28
Hierarchical Clustering
  • Creates not one set of clusters but several sets
    of clusters
  • The desired number of clusters k is not an input
  • The hierarchy of clusters can be represented as a
    tree structure called dendrogram
  • Leaves of the dendrogram consist of 1 item
  • each item is in one cluster
  • Root of the dendrogram contains all items
  • all items form one cluster
  • Internal nodes represent clusters formed by
    merging the clusters of the children
  • Each level is associated with a distance
    threshold that was used to merge the clusters
  • If the distance between 2 clusters was smaller
    than the threshold they were merged

29
Dendrogram Representation
  • A set of ordered triples (d,k,K)
  • d threshold value
  • k number of clusters
  • K the set of clusters
  • Example
  • ( 0, 5, A,B,C, D, E ),
  • (1, 3, A,B, C,D, E ),
  • (2, 2, A,B,C,D, E ),
  • (3, 1, A,B,C,D,E )
  • Thus, the output is not one set of clusters but
    several. One can determine which of the sets to
    use.

30
Agglomerative vs Divisive Clustering
  • Agglomerative
  • Start with each item in its own cluster
    iteratively merge clusters until all items belong
    to one cluster
  • Merging is based on how close the clusters are to
    each other
  • Calculating distance between clusters - single
    link, complete link, average link
  • Distance threshold d if the distance between two
    clusters is smaller or equal to d, merge them
  • Initially d is set to a small value that is
    incremented at each level
  • Divisive
  • Place all items in one cluster iteratively split
    clusters in two until all items are in their own
    cluster
  • Splitting is based on the distance between
    clusters split if the distance is smaller or
    equal to the threshold d
  • Initially d is set to a big value that is
    decremented at each level

31
Agglomerative Algorithms Pseudo Code
  • Different algorithms merge clusters at each level
    differently (procedure NewClusters)
  • Merge only 2 or more clusters?
  • If there are several clusters with identical
    distances, which ones to merge?
  • How to determine the distance between clusters?
  • single link
  • complete link
  • average link

32
NewClusters Procedure
  • NewClusters typically finds all the clusters that
    are within distance d from each other (according
    to the distance measure used), merges them and
    updates the adjacency matrix
  • Example
  • Given 5 items with the distance between them
  • Task Cluster them using agglomerative single
    link clustering

33
Example Solution 1
  • Distance Level 3 merge ABCDE all items are in
    one cluster stop
  • Dendrogram
  • Distance Level 1 merge AB, CD update the
    adjacency matrix
  • Distance Level 2 merge ABCD update the
    adjacency matrix

34
Single Link Algorithm as a Graph Problem
  • NewClusters can be replaced with a procedure for
    finding connected components in a graph
  • two components of a graph are connected if there
    exists a path between any 2 vertices
  • Examples
  • A and B are connected, A, B, C and D are
    connected
  • C and D are connected
  • Show the graph edges with a distance of d or
    below
  • Merge 2 clusters if there is at least 1 edge that
    connects them (i.e. if the minimum distance
    between any 2 points is lt d)
  • Increment d

35
Example Solution 2
  • Procedure NewClusters
  • Input graph defined by a set of vertices
    vertex adjacency matrix
  • Output a set of connected components defined by
    a number of these components (i.e. number of
    clusters k) and an array with the membership of
    these components (i.e. K - the set of clusters)

Single link dendrogram
36
Single Link vs. Complete Link Algorithm
  • Single link suffers from the so called chain
    effect
  • 2 clusters are merged if only 2 of their points
    are close to each other
  • there may be points in the 2 clusters that are
    far from each other but this has no effect on the
    algorithm
  • Thus the clusters may contain points that are not
    related to each other but simply happen to be
    near points that are close to each other
  • Complete link the distance between 2 clusters
    is the largest distance between an element in one
    cluster and an element in another
  • Generates more compact clusters
  • Dendrogram for the example

37
Average Link
  • Average link - the the distance between 2
    clusters is the average distance between an
    element in one cluster and an element in another

For our example
arbitrary set can be 1
38
Divisive Clustering
  • All items are initially placed in one cluster
  • Clusters are iteratively split in two until all
    items are in their own cluster
  • In reverse order from e to b

39
Applicability and Complexity
  • Hierarchical clustering algorithms are suitable
    for domains with natural nesting relationships
    between clusters
  • Biology- plant and animal taxonomies can be
    viewed as a hierarchy of clusters
  • Space complexity of the algorithm O(n2), n -
    number of items
  • The space required to store the adjacency
    distance matrix
  • Space complexity of the dendrogram O(kn),
    k-number of levels
  • Time complexity of the algorithm O(kn2) 1
    iteration for each level of the dendrogram
  • Not incremental assume all data is present
Write a Comment
User Comments (0)
About PowerShow.com