Title: Cluster Analysis
1Cluster Analysis
Chapter 7
2- The Course
DW
DS
OLAP
Star Schema
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3Chapter Outline
- What is Cluster Analysis?
- Types of Data in Cluster Analysis
- A Categorization of Major Clustering Methods
- Partitioning
- Hierarchical
- Density-Based
- Grid-Based (skip)
- Model-Based (skip)
- Outlier Analysis
4What is Clustering?
- Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing - Organizing data into classes such that there is
- high intra-class similarity
- low inter-class similarity
- Finding the class labels and the number of
classes directly from the data Pin contrast to
classification). - More informally, finding natural groupings among
objects
5What is Cluster Analysis?
- Cluster Collection of data objects (records)
- (Intraclass similarity) - Objects are similar to
objects in same cluster - (Interclass dissimilarity) - Objects are
dissimilar to objects in other clusters - Cluster analysis
- Statistical method for grouping a set of data
objects into clusters - A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity - Clustering is unsupervised classification
- More informally, finding natural groupings among
objects
6 What is Cluster Analysis?
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Salary
Age
Job Title
7Typical Clustering Applications
- As a stand-alone tool to
- get insight into data distribution
- find the characteristics of each cluster
- assign the cluster of a new example
- As a preprocessing step for other algorithms
- e.g. numerosity reduction using cluster centers
to represent data in clusters. (See the example
in the next slide.) - It is a building block for many data mining
solutions
8Clustering Example Fitting Troops
- Fitting the troops re-design of uniforms for
soldiers - Goal reduce the number of uniform sizes to be
kept in inventory while still providing good fit - Researchers from Cornell University used
clustering and designed a new set of sizes - Traditional clothing size system ordered set of
graduated sizes where all dimensions increase
together - The new system sizes that fit body types
- E.g. one size for short-legged, small waisted,
women with wide and long torsos, average arms,
broad shoulders, and skinny necks
9Other Examples of Clustering Applications
- Marketing
- help discover distinct groups of customers, and
then use this knowledge to develop targeted
marketing programs - Biology
- derive plant and animal taxonomies
- find genes with similar function
- Land use
- identify areas of similar land use in an earth
observation database - Insurance
- identify groups of motor insurance policy holders
with a high average claim cost - City-planning
- identify groups of houses according to their
house type, value, and geographical location
10Requirements of Clustering
- Scalability
- Ability to deal with various types of attributes
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge
- Can deal with noise and outliers
- Insensitive to the order of input records
- Can handle high dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
11What is Cluster Analysis?
- What is Good Clustering
- High within-class similarity and low
between-class similarity - The ability to discover some or all of the
hidden patterns
Outlier
12How do we measure similarity or dissimilarity?
Peter
Piotr
0.23
3
342.7
13Data Representation
- Data matrix
- N objects with p attributes
- Distances are normally used to measure the
similarity or dissimilarity between two data
objects - Dissimilarity matrix
- d(i,j) dissimilarity
- between i and j
14 Data Representation
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
15Types of Data in Cluster Analysis
- Interval-Scaled Attributes Continuous
measurements of a roughly linear scale. Example
weight, temperature, income, etc - Binary Attributes
- Nominal Attributes categorical values where
order has no meaning Example color, gender - Ordinal Attributes categorical values where
order has meaning Example Rank - Ratio-Scaled Attributes Continuous measurements
of a non linear scale, approximately at
exponential scale - Mixed Attributes combination of the above data
types
16Dissimilarity of Interval-Scaled Values
- Step 1 Standardize the data
- To ensure they all have equal weight
- To match up different scales into a uniform,
single scale - Not always needed! Sometimes we require unequal
weights for an attribute - Step 2 Compute dissimilarity between records
- Use Euclidean, Manhattan or Minkowski distance
17Distance Metrics
- Minkowski
- Manhattan
- Euclidean
- Weighted
18Dissimilarity between Binary Variables
- Method A contingency table for binary data
- If the binary variable is symmetric
- If the binary variable is asymmetric
Object j
Object i
19Dissimilarity between Binary Variables Example
- gender is a symmetric binary
- the remaining attributes are asymmetric binary
- let the values Y and P be set to 1, and the value
N be set to 0
20Dissimilarity Between Nominal Attributes
- A generalization of the binary attribute in that
it can take more than 2 states, e.g., red,
yellow, blue, green - Method 1 Simple matching
- m of attributes that are same for both
records, p total of attributes - Method 2 rewrite the database and create a new
binary attribute for each of the m states - For an object with color yellow, the yellow
attribute is set to 1, while the remaining
attributes are set to 0.
21Dissimilarity Between Ordinal Attributes
- An ordinal attribute can be discrete or
continuous - Order is important (e.g. rank)
- Can be treated like interval-scaled
- replacing xif by their rank
- map the range of each variable onto 0, 1 by
replacing i-th object in the f-th attribute by - compute the dissimilarity using methods for
interval-scaled attributes
22Dissimilarity Between Ratio-Scaled Attributes
- Ratio-scaled attribute a positive measurement on
a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt - Methods
- treat them like interval-scaled attributes not
a good choice because scales may be distorted - apply logarithmic transformation
- yif log(xif)
- treat them as continuous ordinal data and treat
their rank as interval-scaled.
23Dissimilarity Between Attributes of Mixed Types
- A database may contain all the six types of
attributes - symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio. - Use a weighted formula to combine their effects.
- f is binary or nominal dij(f) 0 if xif xjf
, o.w. dij(f) 1. - f is interval-based use the normalized distance
- f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled
24Major Clustering Approaches
- Partitioning approach
- Partitions objects and then evaluates the
partitions by some criterion, like, minimizing
the sum of square errors. E.g. k-means
k-medoids - Hierarchical approach
- Create a hierarchical clustering of the set of
data (or objects) using some criterion E.g.
Diana, Agnes, BIRCH, ROCK, and CAMELEON - Density-based approach
- Clusters objects based on connectivity and
density functions E.g. DBSACN OPTICS - Grid-Based (skip)
- Model-Based (skip)
25Partitioning Algorithms Basic Concept
- Partitioning method Construct a partition of a
database D of n objects into a set of k clusters,
s.t., min sum of squared distance - Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all
partitions - Heuristic methods
- k-means Each cluster is represented by the
center of the cluster - k-medoids or PAM (Partition around medoids)
Each cluster is represented by one of the objects
in the cluster
26K-Means
27The K-Means Clustering Method
- Given k, the k-means algorithm is implemented in
four steps
28Stopping/convergence criterion
- no (or minimum) re-assignments of data points to
different clusters, - no (or minimum) change of centroids, or
- minimum decrease in the sum of squared error
(SSE),
(1)
Ci is the jth cluster, mj is the centroid of
cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
293-means Example Step 1
303-means Example Step 2
313-means Example Step 3
323-means Example Step 4
333-means Example Step 5
343-means Example 2, Step 6
352-means Example
- For simplicity, 1 dimensional objects and k2.
- Objects 1, 2, 5, 6,7
- K-means
- Randomly select 5 and 6 as initial centroids
- gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5 - gt 1,2, 5,6,7 meanC11.5, meanC26
- gt no change.
- Aggregate dissimilarity 0.52 0.52 12
12 2.5
36Comments on the K-Means Method
- Strength of the k-means
- Relatively efficient O(tkn), where n is of
objects, k is of clusters, and t is of
iterations. Normally, k, t ltlt n. - Often terminates at a local optimum.
- Weakness of the k-means
- Applicable only when mean is defined, then what
about categorical data? - Need to specify k, the number of clusters, in
advance. - Unable to handle noisy data and outliers.
- Not suitable to discover clusters with non-convex
shapes.
37Variations of the K-Means Method
- A few variants of the k-means which differ in
- Selection of the initial k means.
- Dissimilarity calculations.
- Strategies to calculate cluster means.
- Handling categorical data k-modes (Huang98)
- Replacing means of clusters with modes.
- Using new dissimilarity measures to deal with
categorical objects. - Using a frequency-based method to update modes of
clusters. - A mixture of categorical and numerical data
k-prototype method.
38K-Medoid
39Medoid - definition
- A medoid is an actual point in the dataset that
is centrally located and is therefore
representative of the cluster.
40k-medoid methods
- There are three best-known k-medoid methods
- PAM (Partitioning Around Medoids)
- CLARA (Clustering LARge Applications)
- CLARANS
41PAM
- Arbitrarily choose k objects as the initial
medoids - Until no change, do
- (Re)assign each object to the cluster to which
the nearest medoid - Randomly select a non-medoid object o, compute
the total cost, E, of swapping medoid o with o - If E lt 0 then swap o with o to form the new set
of k medoids
42Swapping Cost
- Measure whether o is better than o as a medoid
- Use the squared-error criterion
- Compute Eo-Eo
- Negative swapping brings benefit
43PAM Example
Arbitrary choose k object as initial medoids
Assign each remaining object to nearest medoids
K2
Randomly select a nonmedoid object,Oramdom
Do loop Until no change
Compute total cost of swapping
Swapping O and Oramdom If quality is improved.
44Pros and Cons of PAM
- PAM is more robust than k-means in the presence
of noise and outliers - Medoids are less influenced by outliers
- PAM is efficiently for small data sets but does
not scale well for large data sets - O(k(n-k)2 ) for each iteration
- Sampling based method CLARA
45CLARA (Clustering LARge Applications)
- Draw multiple samples of the data set, apply PAM
on each sample, give the best clustering - Perform better than PAM in larger data sets
- Efficiency depends on the sample size
- A good clustering on samples may not be a good
clustering of the whole data set
46CLARANS (Clustering Large Applications based upon
RANdomized Search)
- The problem space graph of clustering
- A vertex is k from n numbers, vertices in
total - PAM search the whole graph
- CLARA search some random sub-graphs
- CLARANS climbs mountains
- Randomly sample a set and select k medoids
- Consider neighbors of medoids as candidate for
new medoids - Use the sample set to verify
- Repeat multiple times to avoid bad samples
47- Algorithm CLARANS
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- Input parameters numlocal and maxneighbor.
- Initialize i to 1, and mincost to a large number.
- Set current to an arbitrary node in G,,k.
- Set j to 1.
- Consider a random neighbor S of current, and
- based on Equation (5) calculate the cost
differential - of the two nodes.
- If 5 haa a lower cost, set current to S, and go
to - Step (3).
- Otherwise, increment j by 1. If j 5 maxneighbor,
48- Requires arbitrary objects and a distance
function - ! Medoid mC representative object in a cluster C
- ! Measure for the compactness of a Cluster C
- ! Measure for the compactness of a clustering
- 6
-
- p C
- C TD(C) dist( p,m )
-
-
- k
- i
- TD TD Ci
- 1
- ( )
49- CLARA Kaufmann and Rousseeuw,1990
- ! Additional parameter numlocal
- ! Draws numlocal samples of the data set
- ! Applies PAM on each sample
- ! Returns the best of these sets of medoids as
output - ! CLARANS Ng and Han, 1994)
- ! Two additional parameters maxneighbor and
numlocal - ! At most maxneighbor many pairs (medoid M,
non-medoid N) are - evaluated in the algorithm.
- ! The first pair (M, N) for which TDN9M is
smaller than TDcurrent is - swapped (instead of the pair with the minimal
value of TDN9M ) - ! Finding the local minimum with this procedure
is repeated - numlocal times.
- ! Efficiency runtime(CLARANS) lt runtime(CLARA) lt
runtime(PAM)
50- CLARANS(objects DB, Integer k, Real dist,
- Integer
numlocal, Integer maxneighbor) - for r from 1 to numlocal do
- Randomly select k objects as medoids i 0
- while i lt maxneighbor do
- Randomly select (Medoid M, Non-medoid N)
- Compute changeOfTD_ TDN9M TD
- if changeOfTD lt 0 then
- substitute M by N
- TD TDN9M i 0
- else i i 1
- if TD lt TD_best then
- TD_best TD Store current medoids
- return Medoids
51Hierarchical Methods
52Hierarchical Clustering
- Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
53AGNES (Agglomerative Nesting)
- Agglomerative, Bottom-up approach
- Merge nodes that have the least dissimilarity
- Go on in a non-descending fashion
- Eventually all nodes belong to the same cluster
54DIANA (Divisive Analysis)
- Top-down approach
- Inverse order of AGNES
- Eventually each node forms a cluster on its own
55A Dendrogram
- Shows How the Clusters are Merged Hierarchically
- Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. - A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster
56Recent Hierarchical Clustering Methods
- Major weakness of agglomerative clustering
methods - do not scale well time complexity of at least
O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based
clustering - BIRCH uses CF-tree and incrementally adjusts the
quality of sub-clusters - ROCK clustering categorical data by neighbor and
link analysis - CHAMELEON hierarchical clustering using dynamic
modeling
57BIRCH
- Birch Balanced Iterative Reducing and Clustering
using Hierarchies - Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering - Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data) - Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
58Clustering Feature Vector in BIRCH
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
59CF-Tree in BIRCH
- Clustering feature
- summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view. - registers crucial measurements for computing
cluster and utilizes storage efficiently - A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering - A nonleaf node in a tree has descendants or
children - The nonleaf nodes store sums of the CFs of their
children - A CF tree has two parameters
- Branching factor specify the maximum number of
children. - threshold max diameter of sub-clusters stored at
the leaf nodes
60The CF Tree Structure
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
61BIRCH
- Strength
- Scales linearly finds a good clustering with a
single scan - Improves the quality with a few additional scans
- Weakness
- handles only numeric data
- No natural clustering due to the specification of
branching factor. - Clusters are of spherical shape
62Density Based Clustering
63Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN
- OPTICS (If there is time)
64DBSCAN
- DBSCAN (Density Based Spatial Clustering of
Applications with Noise) is a density-based
algorithm. - Relies on a density-based notion of cluster A
cluster is defined as a maximal set of
density-connected points - A point is a core point if it has more than a
specified number of points MinPts and within a
specified raduis Eps - These are points that are at the interior of a
cluster - A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point - A noise point is any point that is not a core
point or a border point.
65DBSCAN Core, Border, and Noise Points
66DBSCAN The Algorithm
- Algorithms
- Arbitrary select a point p
- Retrieve all points density-reachable from p
w.r.t. Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database. - Continue the process until all of the points have
been processed. - Note A point p is density-reachable from a point
q w.r.t. Eps, MinPts if there is a chain of
points p1, , pn, p1 q, pn p such that p2
pn-1 are all core points.
67DBSCAN Core, Border and Noise Points
Original Points
Point types core border and noise
Eps 10, MinPts 4
68When DBSCAN Works Well
Original Points
- Resistant to Noise
- Can handle clusters of different shapes and sizes
69When DBSCAN Does NOT Work Well
MinPts4 Eps9.75
Original Points
MinPts4 Eps9.92
- Varying densities
- High-dimensional data
70End