Cluster Analysis - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Cluster Analysis

Description:

K-nearest neighbor classification and data imputation. Interpret clustering results ... Data imputation with KNN ... Imputation. Use the data imputation ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 86
Provided by: CraigAS7
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Craig A. Struble
  • Department of Mathematics, Statistics, and
    Computer Science
  • Marquette University

2
Overview
  • Background
  • Example K-Means
  • Dissimilarity
  • Tangent K-Nearest Neighbor
  • Partitioning Methods
  • Hierarchical Methods
  • Probability Based Methods
  • Interpreting Clusters

3
Goals
  • Explore different clustering techniques
  • Understand distance measures
  • Define new distance measures
  • K-nearest neighbor classification and data
    imputation
  • Interpret clustering results
  • Applications of clustering

4
Background
  • Clustering is the process of identifying natural
    groupings in the data
  • Unsupervised learning technique
  • No predefined class labels
  • Classic text is Finding Groups in Data by Kaufman
    and Rousseeuw, 1990
  • Several implementations
  • R and Weka

5
Clustering
6
Visualizations
7
Example K-Means
Algorithm Kmeans Input k number
of clusters t number of
iterations data the
data Output C a set of k
clusters cent arbitrarily select k objects as
initial centers for i 1 to t do for each d
in data do assign label x to d such that
dist(d, centx) is minimized for
x 1 to k do centx mean value of all
data with label x
8
K-Means Example
  • Use Euclidean distance (l is number of
    dimensions)
  • Select 10,4 and 5,4 as initial points

9
K-Means Clustering
3 clusters
2 clusters
10
Cluster Centers
11
K-Means Summary
  • Very simple algorithm
  • Only works on data for which means can be
    calculated
  • Continuous data
  • O(knt) time complexity
  • Circular cluster shape only
  • Outliers can have very negative impact

12
Outliers
13
Calculating Dissimilarity
  • Many clustering algorithms use a dissimilarity
    matrix as input
  • Instance x Instance

14
Continuous (Interval-scaled)
  • Euclidean distance
  • Manhattan (or city block) distance
  • Minkowski distance

15
Boolean Attributes
  • Contingency Table
  • Symmetric (trues and falses carry equal
    information)

Object j
Object i
16
Boolean Attributes
  • Assymmetric, true and false differ in importance
  • Test for HIV positive
  • Number of matches on false, t, not important
  • Jaccard coefficient

17
Boolean Example
18
Nominal Attributes
  • Count the number of matches m
  • The total number of attributes is p

19
Nominal Example
20
Ordinal Attributes
  • Use ordering to rank the attribute
  • none/1, some/2, full/3
  • Replace rank with standardized value 0,1
  • Calculate using interval dissimilarity value

21
Ratio Attributes
  • Recall Meaningful zero point
  • Generally, only positive values on non-linear
    scale
  • Often follow exponential laws in time
  • Decay of radiation intensity
  • for some constants A and B

22
Options for Ratio Attributes
  • Treat them as interval values
  • Perform a log transformation
  • Treat as an ordinal value, transforming the
    values into rankings.

23
Example
  • Suppose we had calculated the radiation
    intensities of two samples after 10 seconds.
  • x1 8.0 x210.0
  • Option 1 (Euclidean Distance)
  • diss(x1,x2) 2.0
  • Option 2 (log2 transformation)
  • y1log28.03 y2log210.03.32
  • diss(x1,x2)0.32

24
Example (cont.)
  • Option 3 (Ordinal transformation)
  • Assume that 8 is the 4th highest, 10 is the 8th
    higest, and there are 10 total values
  • y1 (4-1)/(10-1) 0.33
  • y2(8-1)/(10-1)0.78
  • diss(x1,x2)0.45

25
Combining Multiple Types
  • Let xi be the ith instance
  • Let xif be the fth attribute of xi
  • Let ?ij(f)0 if xif or xjf are missing, 1
    otherwise
  • Let dij(f) be the distance measure for feature f
    of xi and xj

26
Combining Multiple Types
  • If f is binary or nominal
  • dij(f)0 if xifxjf or 1 otherwise
  • If f is interval-based
  • where h runs over instances not missing f
  • If f is ordinal or ratio-scaled, calculate ranks
    and z score, then treat as interval.

27
Example
28
Standardization
  • When working with interval-scaled variables only,
    its often necessary to standardize the values
  • Avoid bias towards one attribute
  • Calculate mean mf of attribute f
  • Calculate mean absolute deviation
  • Calculate z score for the attribute for each
    instance

29
Tangent K-Nearest Neighbor (KNN)
  • Simple classification technique
  • Let y be a new instance to classify
  • Find the K nearest instances (with labels) to y
  • Classify y with the label appearing most
    frequently

30
Data imputation with KNN
  • Use KNN technique with instance missing a value
    as described previously
  • Take the mean (median, mode, etc.) of the K
    nearest neighbors
  • Use the value to fill in for the missing value

31
Example (Data Imputation)
32
Assessing Data Imputation
  • Use the data imputation technique of choice
  • Calculate the mean squared error
  • Use a statistical test to identify confidence
    range of error
  • Need to identify distribution of error (e.g.
    normal, etc.)

33
Partitioning Methods
  • K-Means (already done)
  • K-Medoids (PAM)
  • CLARA
  • Fuzzy Clustering

34
Clustering
35
K-Medoids
  • K-Means is restricted to numeric attributes
  • Have to calculate the average object
  • A medoid is a representative object in a cluster.
  • The algorithm searches for K medoids in the set
    of objects.

36
K-Medoids
37
PAM Algorithm
  • PAM consists of two phases
  • BUILD - constructs an initial clustering
  • SWAP - refines the clustering
  • Goal Minimize the sum of dissimilarities to K
    representative objects.
  • Mathematically equivalent to minimizing average
    dissimilarity

38
PAM Algorithm (BUILD Phase)
  • Algorithm PAM (BUILD Phase)
  • // Select k representative objects that appear to
    minimize
  • // dissimilarities
  • selected // empty set
  • for x 1 to k do
  • maxgain 0
  • for each i in data - selected do
  • for each j in data - selected do
  • // See if j is closer to i than some
    other
  • // previous selected object
  • let Dj min(diss(j,k) for each k in
    selected)
  • let Cji max(Dj - diss(j,i), 0)
  • gain gain Cji // total up
    improvements from i
  • if gain gt maxgain then
  • maxgain gain
  • best i
  • selected selected best // best
    representative object chosen

39
PAM Algorithm (SWAP Phase)
  • Algorithm PAM (SWAP Phase)
  • // Improve the partitioning, selected comes from
    BUILD phase
  • besti besth first element in selected
  • repeat
  • selected selected - besti besth // swap i
    and h
  • minswap 0
  • for each i in selected do
  • for each h in data - selected do
  • swap 0
  • for each j in data - selected do
  • let Dj min(diss(j,k) for each k
    in selected)
  • if Djltdiss(j,i) and Djltdiss(j,h)
    then
  • // closer to something else
  • swap swap 0 // do
    nothing
  • else if diss(j,i) Dj then //
    closest to i
  • let Ej be min(diss(j,k) for
    each k in selected - i)
  • if diss(j,h) lt Ej then
  • swap swap diss(j,h) -
    Dj
  • else

40
PAM Output (from R)
41
PAM Output (from R)
42
Silhouettes
  • These plots give an intuitive sense of how good
    the clustering is
  • Let diss(i,C) be the average dissimilarity
    between i and each element in cluster C
  • Let A be the cluster instance i is in
  • a(i) diss(i,A)
  • Let B ltgt A be the cluster such that diss(i,B) is
    minimized
  • b(i) diss(i,B)
  • The silhouette number s(i) is

43
Silhouettes
  • Let s(k) be the average silhouette width for k
    clusters.
  • The silhoutte coefficient of a data set is
  • The k that maximizes this value is generally the
    best of clusters to use

44
CLARA
  • Clustering LARge Applications
  • Modification of PAM
  • Draw a sample from large data set
  • Size is typically 402k
  • Use PAM to find medoids of sample
  • Use medoids to assign clusters to entire data
  • Can take multiple samples and choose best
    resulting clustering

45
Fuzzy Clustering (FANNY)
  • Previous partitioning methods are hard
  • Data can be in one and only one cluster
  • Instead of saying in or out, give a percent of
    membership
  • This is basis for fuzzy logic

46
Fuzzy Clustering (FANNY)
  • Algorithm is a bit too complex to cover
  • Idea Minimize the objective function
  • where uiv is the unknown membership of object i
    to cluster v

47
FANNY Results
48
FANNY Results
49
Huh?
50
Whoa!
51
Hierarchical Methods
  • Top-down vs. bottom-up
  • Agglomerative Nesting (AGNES)
  • Divisive Analysis (DIANA)
  • BIRCH

52
Top-Down vs. Bottom-Up
  • Top-down or divisive approaches split the whole
    data set into smaller pieces
  • Bottom-up or agglomerative approaches combine
    individual elements

53
Agglomerative Nesting
  • Combine clusters until one cluster is obtained
  • Initially each cluster contains one object
  • At each step, select the two most similar
    clusters

54
Cluster Dissimilarities
diss(i,j)
R
Q
55
AGNES Results
56
AGNES Results
57
AGNES Modifications
  • The dissimilarity between clusters can be defined
    differently
  • Maximum dissimilarity between two objects
  • Complete linkage
  • Minimum dissimilarity between two objects
  • Single linkage
  • Centroid method
  • Interval scaled attributes
  • Wards method
  • Interval scaled attributes
  • Error sum of squares of a cluster

58
Divisive Analysis (DIANA)
  • Calculate the diameter of each cluster Q
  • Select the cluster Q with the largest diameter
  • Split into A and B
  • Select object i that maximizes
  • Move i from A to B if max value gt 0

59
DIANA Results
60
DIANA Results
61
BIRCH
  • Balanced Iterative Reducing and Clustering Using
    Hierarchies
  • Mixes hierarchical clustering with other
    techniques
  • Useful for large data sets, because entire data
    is not kept in memory
  • Identifies and removes outliers from clustering
  • Due to differing distribution of data
  • Presentation assumes continuous data

62
Two central concepts
  • A cluster feature (CF) is a triple summarizing
    information about a cluster
  • where N is the number of points in the cluster,
    LS is the linear sum of the data points, SS is
    the square sum of data points.

63
Two central concepts
  • Contain enough information to calculate a variety
    of distance measures
  • Addition of CFs accurately represents CF for
    merged clusters
  • Memory and time efficient to maintain

64
Two Central Concepts
  • A CF tree is a height balanced tree with two
    parameters, branching factor B, and threshold T.

Root
CF1
CF2
CF3
CFn
CF11
CF12
CF13
Level 1
CF1n
Clusters
65
Phase 1 Build CF tree
  • CF tree is contained in memory and created
    dynamically
  • Identify appropriate leaf Recursively descend
    tree following closest child node.
  • Modify leaf When leaf is reached, add new data
    item x to leaf. If leaf contains more than L
    entries, split the leaf. (Must also satisfy T.)

66
Phase 1 Build CF tree
  • Steps continued
  • Modify path to leaf Update CF of parent. If leaf
    split, add new entry to parent. If parent
    violates B, split parent node. Update parent
    nodes recursively.
  • Merging refinement Find some non-leaf node Nj
    corresponding to a split stoppage. Find two
    closest entries in Nj. If they are not due to the
    split, merge the two entries.

67
Phase 1 Comments
  • The parameters B and L are a function of the page
    size P
  • Splits are caused by P, not by data distribution
  • Hence, refinement step
  • Increasing T makes a smaller tree, but can hide
    outliers
  • Change T if memory runs out. (Phase 2)

68
Phase 3 Global Clustering
  • Apply some global clustering technique to the
    leaf clusters in the CF tree
  • Fast, because everything is in memory
  • Accurate, because outliers removed, data
    represented at a level allowable by memory
  • Less order dependent, because leaves have data
    locality

69
Phase 4 Cluster Refinement
  • Use centroids of the clusters found in Phase 3
  • Identify centroid C closest to data point
  • Place data point in cluster represented by C

70
Probabilistic Methods
  • These will be delayed until we cover
    probabilistic methods in general
  • Mixture Models
  • COBWEB

71
Applications of Clustering
  • Gene function identification
  • Document clustering
  • Modeling economic data

72
Gene Function Identification
  • Genome is the blueprint defining an organism
    (DNA)
  • Genes are inherited portions of the genome
    related to biological functions
  • Proteins, non-coding RNA
  • Given the collection of biological information,
    try to predict or identify the function of genes
    with unknown function

73
Gene Expression
  • A gene is expressed if it is actively being
    transcribed (copied into RNA)
  • Rate of expression is related to the rate of
    transcription
  • Microarray experiments

74
Gene Expression Data
Experiments
Clones
75
Clustering Gene Expression Data
  • Identify genes with similar expression profiles
  • Use clustering
  • Identify function of known genes in a cluster
  • Assign that function to genes of unknown function
    in the same cluster

76
Clustering Gene Expression Data
Yeast Genome
77
Document Classification
  • Represent documents as vectors in a vector space
  • Latent Semantic Indexing
  • Cluster documents in this representation
  • Describe/summarize/evaluate the clusters
  • Label clusters with meaningful descriptions

78
Document transformation
  • Convert document into table form
  • Attributes are important words
  • Value is number of times word appears

79
Document classification
  • Could select a word as classification label
  • Identify after clustering
  • Look at medoid or centroid and examine
    characteristics
  • Look at of times certain words appear
  • Look at which words appear together
  • Look at words that dont appear at all

80
Document Classification
  • Once clusters identified, label each document
    with a cluster label
  • Use a classification technique to identify
    cluster relationships
  • Decision trees for example
  • Other kinds of rule induction

81
Economic Modelling
  • Nettleton et al. (2000)
  • Objective is to identify how to make the Port of
    Barcelona the principal port of entry for
    merchandise.
  • Statistics, clustering, outlier analysis,
    categorization of continuous values

82
Data
  • Vessel specific
  • Date, type, origin, destination, metric tons
    loaded/unloaded, amount invoiced, quality
    measure, etc.
  • Economic indicators
  • Consumer price index, date (monthly), Industrial
    production index, etc.

83
Data Transformation
  • Data aggregated into 4 relational tables
  • Total monthly visits, etc.
  • Joined based on date
  • Separated into training (1997-1998) and test sets
    (1999)

84
Data Exploration
  • Used clustering with Condorcet criterion
  • IBM Intelligent Miner
  • Identified relevant features
  • Production import volume gt 400,000 MT for product
    1
  • Area import volume gt 250,000 MT for area 10
  • Then used rule induction to characterize clusters

85
Cluster Analysis Summarization
  • Unsupervised technique
  • Similarity-based
  • Can modify definition of similarity to meet needs
  • Usually combined with some other descriptive
    technique
  • Decision trees, rule induction, etc.
Write a Comment
User Comments (0)
About PowerShow.com