Cluster Analysis - PowerPoint PPT Presentation

1 / 85

About This Presentation

Title:

Cluster Analysis

Description:

K-nearest neighbor classification and data imputation. Interpret clustering results ... Data imputation with KNN ... Imputation. Use the data imputation ... – PowerPoint PPT presentation

Number of Views:195

Avg rating:3.0/5.0

Slides: 86

Provided by: CraigAS7

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis

Craig A. Struble
Department of Mathematics, Statistics, and
Computer Science
Marquette University

2
Overview

Background
Example K-Means
Dissimilarity
Tangent K-Nearest Neighbor
Partitioning Methods
Hierarchical Methods
Probability Based Methods
Interpreting Clusters

3
Goals

Explore different clustering techniques
Understand distance measures
Define new distance measures
K-nearest neighbor classification and data
imputation
Interpret clustering results
Applications of clustering

4
Background

Clustering is the process of identifying natural
groupings in the data
Unsupervised learning technique
No predefined class labels
Classic text is Finding Groups in Data by Kaufman
and Rousseeuw, 1990
Several implementations
R and Weka

5
Clustering
6
Visualizations
7
Example K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
8
K-Means Example

Use Euclidean distance (l is number of
dimensions)
Select 10,4 and 5,4 as initial points

9
K-Means Clustering
3 clusters
2 clusters
10
Cluster Centers
11
K-Means Summary

Very simple algorithm
Only works on data for which means can be
calculated
Continuous data
O(knt) time complexity
Circular cluster shape only
Outliers can have very negative impact

12
Outliers
13
Calculating Dissimilarity

Many clustering algorithms use a dissimilarity
matrix as input
Instance x Instance

14
Continuous (Interval-scaled)

Euclidean distance
Manhattan (or city block) distance
Minkowski distance

15
Boolean Attributes

Contingency Table
Symmetric (trues and falses carry equal
information)

Object j
Object i
16
Boolean Attributes

Assymmetric, true and false differ in importance
Test for HIV positive
Number of matches on false, t, not important
Jaccard coefficient

17
Boolean Example
18
Nominal Attributes

Count the number of matches m
The total number of attributes is p

19
Nominal Example
20
Ordinal Attributes

Use ordering to rank the attribute
none/1, some/2, full/3
Replace rank with standardized value 0,1
Calculate using interval dissimilarity value

21
Ratio Attributes

Recall Meaningful zero point
Generally, only positive values on non-linear
scale
Often follow exponential laws in time
Decay of radiation intensity
for some constants A and B

22
Options for Ratio Attributes

Treat them as interval values
Perform a log transformation
Treat as an ordinal value, transforming the
values into rankings.

23
Example

Suppose we had calculated the radiation
intensities of two samples after 10 seconds.
x1 8.0 x210.0
Option 1 (Euclidean Distance)
diss(x1,x2) 2.0
Option 2 (log2 transformation)
y1log28.03 y2log210.03.32
diss(x1,x2)0.32

24
Example (cont.)

Option 3 (Ordinal transformation)
Assume that 8 is the 4th highest, 10 is the 8th
higest, and there are 10 total values
y1 (4-1)/(10-1) 0.33
y2(8-1)/(10-1)0.78
diss(x1,x2)0.45

25
Combining Multiple Types

Let xi be the ith instance
Let xif be the fth attribute of xi
Let ?ij(f)0 if xif or xjf are missing, 1
otherwise
Let dij(f) be the distance measure for feature f
of xi and xj

26
Combining Multiple Types

If f is binary or nominal
dij(f)0 if xifxjf or 1 otherwise
If f is interval-based
where h runs over instances not missing f
If f is ordinal or ratio-scaled, calculate ranks
and z score, then treat as interval.

27
Example
28
Standardization

When working with interval-scaled variables only,
its often necessary to standardize the values
Avoid bias towards one attribute
Calculate mean mf of attribute f
Calculate mean absolute deviation
Calculate z score for the attribute for each
instance

29
Tangent K-Nearest Neighbor (KNN)

Simple classification technique
Let y be a new instance to classify
Find the K nearest instances (with labels) to y
Classify y with the label appearing most
frequently

30
Data imputation with KNN

Use KNN technique with instance missing a value
as described previously
Take the mean (median, mode, etc.) of the K
nearest neighbors
Use the value to fill in for the missing value

31
Example (Data Imputation)
32
Assessing Data Imputation

Use the data imputation technique of choice
Calculate the mean squared error
Use a statistical test to identify confidence
range of error
Need to identify distribution of error (e.g.
normal, etc.)

33
Partitioning Methods

K-Means (already done)
K-Medoids (PAM)
CLARA
Fuzzy Clustering

34
Clustering
35
K-Medoids

K-Means is restricted to numeric attributes
Have to calculate the average object
A medoid is a representative object in a cluster.
The algorithm searches for K medoids in the set
of objects.

36
K-Medoids
37
PAM Algorithm

PAM consists of two phases
BUILD - constructs an initial clustering
SWAP - refines the clustering
Goal Minimize the sum of dissimilarities to K
representative objects.
Mathematically equivalent to minimizing average
dissimilarity

38
PAM Algorithm (BUILD Phase)

Algorithm PAM (BUILD Phase)
// Select k representative objects that appear to
minimize
// dissimilarities
selected // empty set
for x 1 to k do
maxgain 0
for each i in data - selected do
for each j in data - selected do
// See if j is closer to i than some
other
// previous selected object
let Dj min(diss(j,k) for each k in
selected)
let Cji max(Dj - diss(j,i), 0)
gain gain Cji // total up
improvements from i
if gain gt maxgain then
maxgain gain
best i
selected selected best // best
representative object chosen

39
PAM Algorithm (SWAP Phase)

Algorithm PAM (SWAP Phase)
// Improve the partitioning, selected comes from
BUILD phase
besti besth first element in selected
repeat
selected selected - besti besth // swap i
and h
minswap 0
for each i in selected do
for each h in data - selected do
swap 0
for each j in data - selected do
let Dj min(diss(j,k) for each k
in selected)
if Djltdiss(j,i) and Djltdiss(j,h)
then
// closer to something else
swap swap 0 // do
nothing
else if diss(j,i) Dj then //
closest to i
let Ej be min(diss(j,k) for
each k in selected - i)
if diss(j,h) lt Ej then
swap swap diss(j,h) -
Dj
else

40
PAM Output (from R)
41
PAM Output (from R)
42
Silhouettes

These plots give an intuitive sense of how good
the clustering is
Let diss(i,C) be the average dissimilarity
between i and each element in cluster C
Let A be the cluster instance i is in
a(i) diss(i,A)
Let B ltgt A be the cluster such that diss(i,B) is
minimized
b(i) diss(i,B)
The silhouette number s(i) is

43
Silhouettes

Let s(k) be the average silhouette width for k
clusters.
The silhoutte coefficient of a data set is
The k that maximizes this value is generally the
best of clusters to use

44
CLARA

Clustering LARge Applications
Modification of PAM
Draw a sample from large data set
Size is typically 402k
Use PAM to find medoids of sample
Use medoids to assign clusters to entire data
Can take multiple samples and choose best
resulting clustering

45
Fuzzy Clustering (FANNY)

Previous partitioning methods are hard
Data can be in one and only one cluster
Instead of saying in or out, give a percent of
membership
This is basis for fuzzy logic

46
Fuzzy Clustering (FANNY)

Algorithm is a bit too complex to cover
Idea Minimize the objective function
where uiv is the unknown membership of object i
to cluster v

47
FANNY Results
48
FANNY Results
49
Huh?
50
Whoa!
51
Hierarchical Methods

Top-down vs. bottom-up
Agglomerative Nesting (AGNES)
Divisive Analysis (DIANA)
BIRCH

52
Top-Down vs. Bottom-Up

Top-down or divisive approaches split the whole
data set into smaller pieces
Bottom-up or agglomerative approaches combine
individual elements

53
Agglomerative Nesting

Combine clusters until one cluster is obtained
Initially each cluster contains one object
At each step, select the two most similar
clusters

54
Cluster Dissimilarities
diss(i,j)
R
Q
55
AGNES Results
56
AGNES Results
57
AGNES Modifications

The dissimilarity between clusters can be defined
differently
Maximum dissimilarity between two objects
Complete linkage
Minimum dissimilarity between two objects
Single linkage
Centroid method
Interval scaled attributes
Wards method
Interval scaled attributes
Error sum of squares of a cluster

58
Divisive Analysis (DIANA)

Calculate the diameter of each cluster Q
Select the cluster Q with the largest diameter
Split into A and B
Select object i that maximizes
Move i from A to B if max value gt 0

59
DIANA Results
60
DIANA Results
61
BIRCH

Balanced Iterative Reducing and Clustering Using
Hierarchies
Mixes hierarchical clustering with other
techniques
Useful for large data sets, because entire data
is not kept in memory
Identifies and removes outliers from clustering
Due to differing distribution of data
Presentation assumes continuous data

62
Two central concepts

A cluster feature (CF) is a triple summarizing
information about a cluster
where N is the number of points in the cluster,
LS is the linear sum of the data points, SS is
the square sum of data points.

63
Two central concepts

Contain enough information to calculate a variety
of distance measures
Addition of CFs accurately represents CF for
merged clusters
Memory and time efficient to maintain

64
Two Central Concepts

A CF tree is a height balanced tree with two
parameters, branching factor B, and threshold T.

Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
65
Phase 1 Build CF tree

CF tree is contained in memory and created
dynamically
Identify appropriate leaf Recursively descend
tree following closest child node.
Modify leaf When leaf is reached, add new data
item x to leaf. If leaf contains more than L
entries, split the leaf. (Must also satisfy T.)

66
Phase 1 Build CF tree

Steps continued
Modify path to leaf Update CF of parent. If leaf
split, add new entry to parent. If parent
violates B, split parent node. Update parent
nodes recursively.
Merging refinement Find some non-leaf node Nj
corresponding to a split stoppage. Find two
closest entries in Nj. If they are not due to the
split, merge the two entries.

67
Phase 1 Comments

The parameters B and L are a function of the page
size P
Splits are caused by P, not by data distribution
Hence, refinement step
Increasing T makes a smaller tree, but can hide
outliers
Change T if memory runs out. (Phase 2)

68
Phase 3 Global Clustering

Apply some global clustering technique to the
leaf clusters in the CF tree
Fast, because everything is in memory
Accurate, because outliers removed, data
represented at a level allowable by memory
Less order dependent, because leaves have data
locality

69
Phase 4 Cluster Refinement

Use centroids of the clusters found in Phase 3
Identify centroid C closest to data point
Place data point in cluster represented by C

70
Probabilistic Methods

These will be delayed until we cover
probabilistic methods in general
Mixture Models
COBWEB

71
Applications of Clustering

Gene function identification
Document clustering
Modeling economic data

72
Gene Function Identification

Genome is the blueprint defining an organism
(DNA)
Genes are inherited portions of the genome
related to biological functions
Proteins, non-coding RNA
Given the collection of biological information,
try to predict or identify the function of genes
with unknown function

73
Gene Expression

A gene is expressed if it is actively being
transcribed (copied into RNA)
Rate of expression is related to the rate of
transcription
Microarray experiments

74
Gene Expression Data
Experiments
Clones
75
Clustering Gene Expression Data

Identify genes with similar expression profiles
Use clustering
Identify function of known genes in a cluster
Assign that function to genes of unknown function
in the same cluster

76
Clustering Gene Expression Data
Yeast Genome
77
Document Classification

Represent documents as vectors in a vector space
Latent Semantic Indexing
Cluster documents in this representation
Describe/summarize/evaluate the clusters
Label clusters with meaningful descriptions

78
Document transformation

Convert document into table form
Attributes are important words
Value is number of times word appears

79
Document classification

Could select a word as classification label
Identify after clustering
Look at medoid or centroid and examine
characteristics
Look at of times certain words appear
Look at which words appear together
Look at words that dont appear at all

80
Document Classification

Once clusters identified, label each document
with a cluster label
Use a classification technique to identify
cluster relationships
Decision trees for example
Other kinds of rule induction

81
Economic Modelling

Nettleton et al. (2000)
Objective is to identify how to make the Port of
Barcelona the principal port of entry for
merchandise.
Statistics, clustering, outlier analysis,
categorization of continuous values

82
Data

Vessel specific
Date, type, origin, destination, metric tons
loaded/unloaded, amount invoiced, quality
measure, etc.
Economic indicators
Consumer price index, date (monthly), Industrial
production index, etc.

83
Data Transformation