Clustering and Visual Data Analysis

About This Presentation

Title:

Clustering and Visual Data Analysis

Description:

Objective function that expresses our notion of interestingness for this data ... (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 25

Provided by: axk

Category:

more less

Transcript and Presenter's Notes

Title: Clustering and Visual Data Analysis

1
Clustering and Visual Data Analysis

Ata Kaban
The University of Birmingham

2
The Clustering Problem
Unsupervised Learning
Data (input)
Interesting structure (output)

Interesting
contains essential characteristics
discards unessential details
provides a summary the data (e.g. to visualise on
the screen)
compact
interpretable for humans
etc.

Objective function that expresses our notion of
interestingness for this data
3
One reason for clustering of data

Here is some data
Assume you transmit the coordinates of points
drawn randomly from this data set
You are only allowed to send a small (say 2 or
3) bits per point
So it will be a lossy transmission
Loss sum of squared errors between the
original and the decoded coordinates
What encoder / decoder will loose the least
information?

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Formalising

What objective does K-means optimise?
Given an encoder function ENCRT?1..K
(T is dimension of data, K is number of clusters)
Given a decoder function DEC1..K?RT
DISTORTIONsum nxn-DECENC(xn)2
where DEC(k)µk are centers of clusters, k1..K
So, DISTORTIONsumnxn-µENC(xn)2, where n goes
from 1 to N, the number of points

11
The minimal distortion

DISTORTIONsumnxn-µENC(x_n)2
This is minimised.
What properties do µ1,µK satisfy for that?
1) each point xn must be encoded by its nearest
center, otherwise DISTORTION could be reduced by
replacing ENC(xn) with the nearest center of xn.
2) each µk must be the centroid of its own points

If N is the known number of points and K the
desired number of clusters, the K-means algorithm
is
Begin
initialize ?1, ?2, ,?K (randomly selected)
do classify n samples according to nearest
?i
recompute ?i
until no change in ?i
return ?1, ?2, , ?K
End

13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Other forms of clustering

Many times, clusters are not disjoint, but a
cluster may have subclusters, in turn having
sub-subclusters.
?Hierarchical clustering

Given any two samples x and x, they will be
grouped together at some level, and if they are
grouped a level k, they remain grouped for all
higher levels
Hierarchical clustering ? tree representation
called dendrogram

The similarity values may help to determine if
the grouping are natural or forced, but if they
are evenly distributed no information can be
gained
Another representation is based on set, e.g., on
the Venn diagrams

Hierarchical clustering can be divided in
agglomerative and divisive.
Agglomerative (bottom up, clumping) start with n
singleton cluster and form the sequence by
merging clusters
Divisive (top down, splitting) start with all of
the samples in one cluster and form the sequence
by successively splitting clusters

Agglomerative hierarchical clustering
The procedure terminates when the specified
number of cluster has been obtained, and returns
the cluster as sets of points, rather than the
mean or a representative vector for each cluster

21
The problem of the number of clusters

Typically, the number of clusters is known.
When its not, that is a hard problem called
model selection. There are several ways of
proceed.
A common approach is to repeat the clustering
with c1, c2, c3, etc.

22
What did we learn today?

Data clustering
K-means algorithm in detail
How K-means can get stuck and how to take care of
that
The outline of Hierarchical clustering methods

23
Pattern ClassificationFind out more
here!Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
Sons, 2000
24
Pattern ClassificationFind out more
here!Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
Sons, 2000

Write a Comment

User Comments (0)