Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering

Description:

Title: Data Mining and Knowledge Discovery in Business Databases Author: Gregory Piatetsky Last modified by: presenter Created Date: 6/4/1996 5:33:28 PM – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 52
Provided by: gregoryp4
Category:
Tags: clustering | data

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
2
Outline
  • Introduction
  • K-means clustering
  • Hierarchical clustering COBWEB

3
Classification vs. Clustering
Classification Supervised learning Learns a
method for predicting the instance class from
pre-labeled (classified) instances
4
Clustering
Unsupervised learning Finds natural grouping
of instances given un-labeled data
5
Clustering Methods
  • Many different method and algorithms
  • For numeric and/or symbolic data
  • Deterministic vs. probabilistic
  • Exclusive vs. overlapping
  • Hierarchical vs. flat
  • Top-down vs. bottom-up

6
Clusters exclusive vs. overlapping
Simple 2-D representation Non-overlapping
Venn diagram Overlapping


7
Clustering Evaluation
  • Manual inspection
  • Benchmarking on existing labels
  • Cluster quality measures
  • distance measures
  • high similarity within a cluster, low across
    clusters

8
The distance function
  • Simplest case one numeric attribute A
  • Distance(X,Y) A(X) A(Y)
  • Several numeric attributes
  • Distance(X,Y) Euclidean distance between X,Y
  • Nominal attributes distance is set to 1 if
    values are different, 0 if they are equal
  • Are all attributes equally important?
  • Weighting the attributes might be necessary

9
Simple Clustering K-means
  • Works with numeric data only
  • Pick a number (K) of cluster centers (at random)
  • Assign every item to its nearest cluster center
    (e.g. using Euclidean distance)
  • Move each cluster center to the mean of its
    assigned items
  • Repeat steps 2,3 until convergence (change in
    cluster assignments less than a threshold)

10
K-means example, step 1
Pick 3 initial cluster centers (randomly)
11
K-means example, step 2
Assign each point to the closest cluster center
12
K-means example, step 3
Move each cluster center to the mean of each
cluster
13
K-means example, step 4
Reassign points closest to a different new
cluster center Q Which points are reassigned?
14
K-means example, step 4
A three points with animation
15
K-means example, step 4b
re-compute cluster means
16
K-means example, step 5
move cluster centers to cluster means
17
Discussion
  • Result can vary significantly depending on
    initial choice of seeds
  • Can get trapped in local minimum
  • Example
  • To increase chance of finding global optimum
    restart with different random seeds

18
K-means clustering summary
  • Advantages
  • Simple, understandable
  • items automatically assigned to clusters
  • Disadvantages
  • Must pick number of clusters before hand
  • All items forced into a cluster
  • Too sensitive to outliers

19
K-means variations
  • K-medoids instead of mean, use medians of each
    cluster
  • Mean of 1, 3, 5, 7, 9 is
  • Mean of 1, 3, 5, 7, 1009 is
  • Median of 1, 3, 5, 7, 1009 is
  • Median advantage not affected by extreme values
  • For large databases, use sampling

5
205
5
20
Hierarchical clustering
  • Bottom up
  • Start with single-instance clusters
  • At each step, join the two closest clusters
  • Design decision distance between clusters
  • E.g. two closest instances in clusters vs.
    distance between means
  • Top down
  • Start with one universal cluster
  • Find two clusters
  • Proceed recursively on each subset
  • Can be very fast
  • Both methods produce adendrogram

21
Incremental clustering
  • Heuristic approach (COBWEB/CLASSIT)
  • Form a hierarchy of clusters incrementally
  • Start
  • tree consists of empty root node
  • Then
  • add instances one by one
  • update tree appropriately at each stage
  • to update, find the right leaf for an instance
  • May involve restructuring the tree
  • Base update decisions on category utility

22
Clustering weather data
  • 1

ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
2
3
23
Clustering weather data
  • 4

ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
E Rainy Cool Normal False
F Rainy Cool Normal True
G Overcast Cool Normal True
H Sunny Mild High False
I Sunny Cool Normal False
J Rainy Mild Normal False
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
5
Merge best host and runner-up
3
Consider splitting the best host if merging
doesnt help
24
Final hierarchy
ID Outlook Temp. Humidity Windy
A Sunny Hot High False
B Sunny Hot High True
C Overcast Hot High False
D Rainy Mild High False
Oops! a and b are actually very similar
25
Example the iris data (subset)
26
Clustering with cutoff
27
Category utility
  • Category utility quadratic loss functiondefined
    on conditional probabilities
  • Every instance in different category ? numerator
    becomes

maximum
number of attributes
28
Overfitting-avoidance heuristic
  • If every instance gets put into a different
    category the numerator becomes (maximal)
  • Where n is number of all possible attribute
    values.
  • So without k in the denominator of the
    CU-formula, every cluster would consist of one
    instance!

Maximum value of CU
29
Levels of Clustering
30
Hierarchical Clustering
  • Clusters are created in levels actually creating
    sets of clusters at each level.
  • Agglomerative
  • Initially each item in its own cluster
  • Iteratively clusters are merged together
  • Bottom Up
  • Divisive
  • Initially all items in one cluster
  • Large clusters are successively divided
  • Top Down

31
Dendrogram
  • Dendrogram a tree data structure which
    illustrates hierarchical clustering techniques.
  • Each level shows clusters for that level.
  • Leaf individual clusters
  • Root one cluster
  • A cluster at level i is the union of its children
    clusters at level i1.

32
Agglomerative Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
B
A
E
C
D
Threshold of
4
2
3
5
1
A
B
C
D
E
33
Distance Between Clusters
  • Single Link smallest distance between points
  • Complete Link largest distance between points
  • Average Link average distance between points
  • Centroid distance between centroids

34
Single Link Clustering
35
Other Clustering Approaches
  • EM probability based clustering
  • Bayesian clustering
  • SOM self-organizing maps

36
Self-Organizing Map
37
Self Organizing Map
  • Unsupervised learning
  • Competitive learning

winner
output
input (n-dimensional)
38
Self Organizing Map
  • Determine the winner (the neuron of which the
    weight vector has the smallest distance to the
    input vector)
  • Move the weight vector w of the winning neuron
    towards the input i

39
Self Organizing Map
  • Impose a topological order onto the competitive
    neurons (e.g., rectangular map)
  • Let neighbors of the winner share the prize
    (The postcode lottery principle)
  • After learning, neurons with similar weights tend
    to cluster on the map

40
Self Organizing Map
input
41
Self Organizing Map
42
Self Organizing Map
  • Input uniformly randomly distributed points
  • Output Map of 202 neurons
  • Training
  • Starting with a large learning rate and
    neighborhood size, both are gradually decreased
    to facilitate convergence

43
Self Organizing Map
44
Self Organizing Map
45
Self Organizing Map
46
(No Transcript)
47
Self Organizing Map
48
Self Organizing Map
49
Discussion
  • Can interpret clusters by using supervised
    learning
  • learn a classifier based on clusters
  • Decrease dependence between attributes?
  • pre-processing step
  • E.g. use principal component analysis
  • Can be used to fill in missing values
  • Key advantage of probabilistic clustering
  • Can estimate likelihood of data
  • Use it to compare different models objectively

50
Examples of Clustering Applications
  • Marketing discover customer groups and use them
    for targeted marketing and re-organization
  • Astronomy find groups of similar stars and
    galaxies
  • Earth-quake studies Observed earth quake
    epicenters should be clustered along continent
    faults
  • Genomics finding groups of gene with similar
    expressions

51
Clustering Summary
  • unsupervised
  • many approaches
  • K-means simple, sometimes useful
  • K-medoids is less sensitive to outliers
  • Hierarchical clustering works for symbolic
    attributes
  • Evaluation is a problem
Write a Comment
User Comments (0)
About PowerShow.com