Title: Clustering methods
1Clustering methods
Part 1 Introduction
Pasi Fränti 9.2.2017 Machine Learning School of
Computing University of Eastern Finland Joensuu,
FINLAND
2Sample data
Sources of RGB vectors
Red-Green plot of the vectors
3Sample data
Employment statistics
4Application examples
5Color reconstruction
Image with original colors
Image with compression artifacts
6Speaker modelingfor voice biometrics
Tomi
Feature extraction and clustering
Mikko
Tomi
Matti
Matti
Training data
Mikko
Feature extraction
Speaker models
?
Best match Matti !
7Speaker modeling
Speech data
Result of clustering
8Image segmentation
Image with 4 color clusters
Normalized color plots according to red and
green components.
green
red
9Signal quantization
Approximation of continuous range values (or a
very large set of possible discrete values) by a
small set of discrete symbols or integer values
Quantized signal
Original signal
10Color quantization of images
Color image
RGB samples
Clustering
11Users on map
12Clustering the users
13Clustering of photos in two ways
Clustering timeline
Clustering of photos
14Photo clusters on map
Last known location of the user
User and date
Number of photos
Clusters
15(No Transcript)
16Clusters in the timeline view
Number of photos
Clusters
Functions
Open cluster
Start slideshow
17Clustering GPS tracksMobile users, taxi routes,
fleet management
18Conclusions from clusters
Cluster 2 Home
Cluster 1 Office
19Clustering keywords
20Clustering text descriptions
21Home take care services
22Clustering user preferences
23Part IClustering problem
24Subproblems of clustering
- Where are the clusters?(Algorithmic problem)
- How many clusters?(Methodological problem which
criterion?) - Selection of attributes (Application related
problem) - Preprocessing the data(Practical problems
normalization, outliers)
25Definitions and data
Xx1, x2, , xN
Partition of the data
Pp1, p2, , pM,
Set of M cluster prototypes (centroids)
Cc1, c2, , cM,
26Distance and cost function
Euclidean distance of data vectors
Total square error
27Clustering result as partition
Cluster prototypes
Partition of data
Illustrated by Voronoi diagram
Illustrated by Convex hulls
28Duality of partition and centroids
Cluster prototypes
Partition of data
Centroids as prototypes
Partition by nearestprototype mapping
29Dependency of data structures
- Centroid condition for a given partition (P),
optimal cluster centroids (C) for minimizing MSE
are the average vectors of the clusters
- Optimal partition for a given centroids (C),
optimal partition is the one with nearest
centroid
30K-means algorithm
31K-means algorithm
X Data set C Cluster centroids P
Partition K-Means(X, C) ? (C, P) REPEAT Cprev ?
C FOR all i?1, N DO pi ? FindNearest(xi,
C) FOR all j?1, k DO cj ? Average of xi ?
pi j UNTIL C Cprev
Optimal partition
Optimal centoids
32Summary
33How to solve?
- Solve the clustering
- Given input data (X) of N data vectors, and
number of clusters (M), find the clusters. - Result given as a set of prototypes, or
partition. - Solve the number of clusters
- Define appropriate cluster validity function f.
- Repeat the clustering algorithm for several M.
- Select the best result according to f.
- Solve the problem efficiently.
Algorithmic problem
Mathematical problem
Computer science problem
34Challenges in clustering
Incorrect cluster allocation
Incorrect number of clusters
Too many clusters
Cluster missing
Clusters missing
35Taxonomy of clusteringJain, Murty, Flynn, Data
clustering A review, ACM Computing Surveys,
1999.
- One possible classification based on cost
function. - MSE is well defined and most popular.
36Clustering method
- Clustering method defines the problem
- Clustering algorithm solves the problem
- Problem defined as cost function
- Goodness of one cluster
- Similarity vs. distance
- Global vs. local (merge cost, cut)
- Solution algorithm to solve the problem
37Complexity of clustering
- Number of possible clusterings
- Clustering problem is NP complete Garey et al.,
1982 - Optimal solution by branch-and-bound in
exponential time. - Practical solutions by heuristic algorithms.
38Software
39Animator
http//cs.uef.fi/sipu/clustering/animator/
40Clusterator
http//cs.uef.fi/paikka/Radu/clusterator/
41Cluster software
http//cs.uef.fi/sipu/soft/cluster2009.exe
- Main area working space for data
- Input area inputs to be processed
- Output areaobtained results
- Menu Processselection of operation
42Procedure to simulate k-means
Clustering image
Data set
Codebook
Partition
Open data set (file .ts), move it into Input
area Process Random codebook, select number of
clusters REPEAT Move obtained codebook from
Output area into Input area Process Optimal
partition, select Error function Move codebook
into Main area, partition into Input
area Process Optimal codebook UNTIL DESIRED
CLUSTERING
43Conclusions
- Clustering is a fundamental tool needed in
everywhere in computer science and beyond. - Failing to do clustering properly may defect the
application analysis. - Good clustering tool needed so that researchers
can focus on application requirements.
44Literature
- S. Theodoridis and K. Koutroumbas, Pattern
Recognition, Academic Press, 3rd edition, 2006. - C. Bishop, Pattern Recognition and Machine
Learning, Springer, 2006. - A.K. Jain, M.N. Murty and P.J. Flynn, Data
clustering A review, ACM Computing Surveys,
31(3) 264-323, September 1999. - M.R. Garey, D.S. Johnson and H.S. Witsenhausen,
The complexity of the generalized Lloyd-Max
problem, IEEE Transactions on Information Theory,
28(2) 255-256, March 1982. - F. Aurenhammer Voronoi diagrams-a survey of a
fundamental geometric data structure, ACM
Computing Surveys, 23 (3), 345-405, September
1991.