EE3J2 Data Mining Lecture 11: Clustering Martin Russell - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

EE3J2 Data Mining Lecture 11: Clustering Martin Russell

Description:

To describe agglomerative and divisive clustering ... Agglomerative clustering. Divisive clustering. Decision tree interpretation ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 23
Provided by: MartinR72
Category:

less

Transcript and Presenter's Notes

Title: EE3J2 Data Mining Lecture 11: Clustering Martin Russell


1
EE3J2 Data MiningLecture 11 ClusteringMartin
Russell
2
Objectives
  • To explain the motivation for clustering
  • To introduce the ideas of distance and distortion
  • To describe agglomerative and divisive clustering
  • To explain the relationships between clustering
    and decision trees

3
Example from speech processing
  • Plot of high-frequency energy vs low-frequency
    energy, for 25 ms speech segments, sampled every
    10ms

4
Structure of data
  • Typical real data is not uniformly distrubuted
  • It has structure
  • Variables might be correlated
  • The data might be grouped into natural clusters
  • The purpose of cluster analysis is to find this
    underlying structure automatically

5
Clusters and centroids
  • If we assume that the clusters are spherical,
    then they are determined by their centres
  • The cluster centres are called centroids
  • How many centroids do we need?
  • Where should we put them?

6
Distance
  • A function d(x,y) defined on pairs of points x
    and y is called a distance or metric if it
    satisfies
  • d(x,x) 0 for every point x
  • d(x,y) d(y,x) for all points x and y (d is
    symmetric)
  • d(x,z) ? d(x,y) d(y,z) for all points x, y and
    z (this is called the triangle inequality)

7
Example metrics
  • The most common metric is the Euclidean metric
  • In this case, if x (x1, x2,,xN) and y
    (y1,y2,,yN) then
  • This corresponds to the standard notion of
    distance in Euclidean space
  • There are lots of others, but focus on this one

8
Distortion
  • Distortion is a measure of how well a set of
    centroids models a set of data
  • Suppose we have
  • data points y1, y2,,yT
  • centroids c1,,cM
  • For each data point yt let ci(t) be the closest
    centroid
  • In other words d(yt, ci(t)) minmd(yt,cm)

9
Distortion
  • The distortion for the centroid set C c1,,cM
    is defined by
  • In other words, the distortion is the sum of
    distances between each data point and its nearest
    centroid
  • The task of clustering is to find a centroid set
    C such that the distortion Dist(C) is minimised

10
Types of Clustering
  • Initially we will look at two types of cluster
    analysis
  • Agglomerative clustering, or bottom-up
    clustering
  • Divisive clustering, or top-down clustering

11
Agglomerative clustering
  • Agglomerative clustering begins by assuming that
    each data point belongs to its own, unique, 1
    point cluster
  • Clusters are then combined until the required
    number of clusters is obtained
  • The simplest agglomerative clustering algorithm
    is one which, at each stage, combines the two
    closest centroids into a single centroid

12
Original data (302 points)
13
252 centroids
14
152 centroids
15
52 centroids
16
12 centroids
17
Divisive Clustering
  • Divisive clustering begins by assuming that there
    is just one centroid typically in the centre of
    the set of data points
  • That point is replaced with 2 new centroids
  • Then each of these is replaced with 2 new
    centroids

18
Original data (302 points)
?
19
Original data (302 points)
?
?
20
Decision tree interpretation
Single centroid - whole set
Multiple centroids one per data point
21
Note on optimality
  • An optimal set of centroids is one which
    minimises the distortion
  • None of these methods necessarily give optimal
    sets of centroids
  • Instead they give locally optimal sets of
    centroids
  • Why?

22
Summary
  • Distance metrics and distortion
  • Agglomerative clustering
  • Divisive clustering
  • Decision tree interpretation
Write a Comment
User Comments (0)
About PowerShow.com