Cluster Analysis - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Cluster Analysis

Description:

Cluster Analysis Chapter 7 - The Course Chapter Outline What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods ... – PowerPoint PPT presentation

Number of Views:667
Avg rating:3.0/5.0
Slides: 68
Provided by: Jiawe4
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
Chapter 7
2
- The Course
DW
DS
OLAP
Star Schema
DP
DS
DM
Association
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining DP Staging Database
3
Chapter Outline
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning
  • Hierarchical
  • Density-Based
  • Grid-Based (skip)
  • Model-Based (skip)
  • Outlier Analysis

4
What is Clustering?
  • Also called unsupervised learning, sometimes
    called classification by statisticians and
    sorting by psychologists and segmentation by
    people in marketing
  • Organizing data into classes such that there is
  • high intra-class similarity
  • low inter-class similarity
  • Finding the class labels and the number of
    classes directly from the data Pin contrast to
    classification).
  • More informally, finding natural groupings among
    objects

5
What is Cluster Analysis?
  • Cluster Collection of data objects (records)
  • (Intraclass similarity) - Objects are similar to
    objects in same cluster
  • (Interclass dissimilarity) - Objects are
    dissimilar to objects in other clusters
  • Cluster analysis
  • Statistical method for grouping a set of data
    objects into clusters
  • A good clustering method produces high quality
    clusters with high intraclass similarity and low
    interclass similarity
  • Clustering is unsupervised classification
  • More informally, finding natural groupings among
    objects

6
What is Cluster Analysis?
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

Salary
Age
Job Title
7
Typical Clustering Applications
  • As a stand-alone tool to
  • get insight into data distribution
  • find the characteristics of each cluster
  • assign the cluster of a new example
  • As a preprocessing step for other algorithms
  • e.g. numerosity reduction using cluster centers
    to represent data in clusters. (See the example
    in the next slide.)
  • It is a building block for many data mining
    solutions

8
Clustering Example Fitting Troops
  • Fitting the troops re-design of uniforms for
    soldiers
  • Goal reduce the number of uniform sizes to be
    kept in inventory while still providing good fit
  • Researchers from Cornell University used
    clustering and designed a new set of sizes
  • Traditional clothing size system ordered set of
    graduated sizes where all dimensions increase
    together
  • The new system sizes that fit body types
  • E.g. one size for short-legged, small waisted,
    women with wide and long torsos, average arms,
    broad shoulders, and skinny necks

9
Other Examples of Clustering Applications
  • Marketing
  • help discover distinct groups of customers, and
    then use this knowledge to develop targeted
    marketing programs
  • Biology
  • derive plant and animal taxonomies
  • find genes with similar function
  • Land use
  • identify areas of similar land use in an earth
    observation database
  • Insurance
  • identify groups of motor insurance policy holders
    with a high average claim cost
  • City-planning
  • identify groups of houses according to their
    house type, value, and geographical location

10
Requirements of Clustering
  • Scalability
  • Ability to deal with various types of attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge
  • Can deal with noise and outliers
  • Insensitive to the order of input records
  • Can handle high dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

11
What is Cluster Analysis?
  • What is Good Clustering
  • High within-class similarity and low
    between-class similarity
  • The ability to discover some or all of the
    hidden patterns

Outlier
12
How do we measure similarity or dissimilarity?
Peter
Piotr
0.23
3
342.7
13
Data Representation
  • Data matrix
  • N objects with p attributes
  • Distances are normally used to measure the
    similarity or dissimilarity between two data
    objects
  • Dissimilarity matrix
  • d(i,j) dissimilarity
  • between i and j

14
Data Representation
  • Properties
  • d(i,j) ? 0
  • d(i,i) 0
  • d(i,j) d(j,i)
  • d(i,j) ? d(i,k) d(k,j)

15
Types of Data in Cluster Analysis
  • Interval-Scaled Attributes Continuous
    measurements of a roughly linear scale. Example
    weight, temperature, income, etc
  • Binary Attributes
  • Nominal Attributes categorical values where
    order has no meaning Example color, gender
  • Ordinal Attributes categorical values where
    order has meaning Example Rank
  • Ratio-Scaled Attributes Continuous measurements
    of a non linear scale, approximately at
    exponential scale
  • Mixed Attributes combination of the above data
    types

16
Dissimilarity of Interval-Scaled Values
  • Step 1 Standardize the data
  • To ensure they all have equal weight
  • To match up different scales into a uniform,
    single scale
  • Not always needed! Sometimes we require unequal
    weights for an attribute
  • Step 2 Compute dissimilarity between records
  • Use Euclidean, Manhattan or Minkowski distance

17
Distance Metrics
  • Minkowski
  • Manhattan
  • Euclidean
  • Weighted

18
Dissimilarity between Binary Variables
  • Method A contingency table for binary data
  • If the binary variable is symmetric
  • If the binary variable is asymmetric

Object j
Object i
19
Dissimilarity between Binary Variables Example
  • gender is a symmetric binary
  • the remaining attributes are asymmetric binary
  • let the values Y and P be set to 1, and the value
    N be set to 0

20
Dissimilarity Between Nominal Attributes
  • A generalization of the binary attribute in that
    it can take more than 2 states, e.g., red,
    yellow, blue, green
  • Method 1 Simple matching
  • m of attributes that are same for both
    records, p total of attributes
  • Method 2 rewrite the database and create a new
    binary attribute for each of the m states
  • For an object with color yellow, the yellow
    attribute is set to 1, while the remaining
    attributes are set to 0.

21
Dissimilarity Between Ordinal Attributes
  • An ordinal attribute can be discrete or
    continuous
  • Order is important (e.g. rank)
  • Can be treated like interval-scaled
  • replacing xif by their rank
  • map the range of each variable onto 0, 1 by
    replacing i-th object in the f-th attribute by
  • compute the dissimilarity using methods for
    interval-scaled attributes

22
Dissimilarity Between Ratio-Scaled Attributes
  • Ratio-scaled attribute a positive measurement on
    a nonlinear scale, approximately at exponential
    scale, such as AeBt or Ae-Bt
  • Methods
  • treat them like interval-scaled attributes not
    a good choice because scales may be distorted
  • apply logarithmic transformation
  • yif log(xif)
  • treat them as continuous ordinal data and treat
    their rank as interval-scaled.

23
Dissimilarity Between Attributes of Mixed Types
  • A database may contain all the six types of
    attributes
  • symmetric binary, asymmetric binary, nominal,
    ordinal, interval and ratio.
  • Use a weighted formula to combine their effects.
  • f is binary or nominal dij(f) 0 if xif xjf
    , o.w. dij(f) 1.
  • f is interval-based use the normalized distance
  • f is ordinal or ratio-scaled
  • compute ranks rif and
  • and treat zif as interval-scaled

24
Major Clustering Approaches
  • Partitioning approach
  • Partitions objects and then evaluates the
    partitions by some criterion, like, minimizing
    the sum of square errors. E.g. k-means
    k-medoids
  • Hierarchical approach
  • Create a hierarchical clustering of the set of
    data (or objects) using some criterion E.g.
    Diana, Agnes, BIRCH, ROCK, and CAMELEON
  • Density-based approach
  • Clusters objects based on connectivity and
    density functions E.g. DBSACN OPTICS
  • Grid-Based (skip)
  • Model-Based (skip)

25
Partitioning Algorithms Basic Concept
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters,
    s.t., min sum of squared distance
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • Global optimal exhaustively enumerate all
    partitions
  • Heuristic methods
  • k-means Each cluster is represented by the
    center of the cluster
  • k-medoids or PAM (Partition around medoids)
    Each cluster is represented by one of the objects
    in the cluster

26
K-Means
27
The K-Means Clustering Method
  • Given k, the k-means algorithm is implemented in
    four steps

28
Stopping/convergence criterion
  • no (or minimum) re-assignments of data points to
    different clusters,
  • no (or minimum) change of centroids, or
  • minimum decrease in the sum of squared error
    (SSE),

(1)
Ci is the jth cluster, mj is the centroid of
cluster Cj (the mean vector of all the data
points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
29
3-means Example Step 1
30
3-means Example Step 2
31
3-means Example Step 3
32
3-means Example Step 4
33
3-means Example Step 5
34
3-means Example 2, Step 6
35
2-means Example
  • For simplicity, 1 dimensional objects and k2.
  • Objects 1, 2, 5, 6,7
  • K-means
  • Randomly select 5 and 6 as initial centroids
  • gt Two clusters 1,2,5 and 6,7 meanC18/3,
    meanC26.5
  • gt 1,2, 5,6,7 meanC11.5, meanC26
  • gt no change.
  • Aggregate dissimilarity 0.52 0.52 12
    12 2.5

36
Comments on the K-Means Method
  • Strength of the k-means
  • Relatively efficient O(tkn), where n is of
    objects, k is of clusters, and t is of
    iterations. Normally, k, t ltlt n.
  • Often terminates at a local optimum.
  • Weakness of the k-means
  • Applicable only when mean is defined, then what
    about categorical data?
  • Need to specify k, the number of clusters, in
    advance.
  • Unable to handle noisy data and outliers.
  • Not suitable to discover clusters with non-convex
    shapes.



37
Variations of the K-Means Method
  • A few variants of the k-means which differ in
  • Selection of the initial k means.
  • Dissimilarity calculations.
  • Strategies to calculate cluster means.
  • Handling categorical data k-modes (Huang98)
  • Replacing means of clusters with modes.
  • Using new dissimilarity measures to deal with
    categorical objects.
  • Using a frequency-based method to update modes of
    clusters.
  • A mixture of categorical and numerical data
    k-prototype method.

38
K-Medoid
39
Medoid - definition
  • A medoid is an actual point in the dataset that
    is centrally located and is therefore
    representative of the cluster.

40
k-medoid methods
  • There are three best-known k-medoid methods
  • PAM (Partitioning Around Medoids)
  • CLARA (Clustering LARge Applications)
  • CLARANS

41
PAM
  • Arbitrarily choose k objects as the initial
    medoids
  • Until no change, do
  • (Re)assign each object to the cluster to which
    the nearest medoid
  • Randomly select a non-medoid object o, compute
    the total cost, E, of swapping medoid o with o
  • If E lt 0 then swap o with o to form the new set
    of k medoids

42
Swapping Cost
  • Measure whether o is better than o as a medoid
  • Use the squared-error criterion
  • Compute Eo-Eo
  • Negative swapping brings benefit

43
PAM Example
Arbitrary choose k object as initial medoids
Assign each remaining object to nearest medoids
K2
Randomly select a nonmedoid object,Oramdom
Do loop Until no change
Compute total cost of swapping
Swapping O and Oramdom If quality is improved.
44
Pros and Cons of PAM
  • PAM is more robust than k-means in the presence
    of noise and outliers
  • Medoids are less influenced by outliers
  • PAM is efficiently for small data sets but does
    not scale well for large data sets
  • O(k(n-k)2 ) for each iteration
  • Sampling based method CLARA

45
CLARA (Clustering LARge Applications)
  • Draw multiple samples of the data set, apply PAM
    on each sample, give the best clustering
  • Perform better than PAM in larger data sets
  • Efficiency depends on the sample size
  • A good clustering on samples may not be a good
    clustering of the whole data set

46
CLARANS (Clustering Large Applications based upon
RANdomized Search)
  • The problem space graph of clustering
  • A vertex is k from n numbers, vertices in
    total
  • PAM search the whole graph
  • CLARA search some random sub-graphs
  • CLARANS climbs mountains
  • Randomly sample a set and select k medoids
  • Consider neighbors of medoids as candidate for
    new medoids
  • Use the sample set to verify
  • Repeat multiple times to avoid bad samples

47
  • Algorithm CLARANS
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • Input parameters numlocal and maxneighbor.
  • Initialize i to 1, and mincost to a large number.
  • Set current to an arbitrary node in G,,k.
  • Set j to 1.
  • Consider a random neighbor S of current, and
  • based on Equation (5) calculate the cost
    differential
  • of the two nodes.
  • If 5 haa a lower cost, set current to S, and go
    to
  • Step (3).
  • Otherwise, increment j by 1. If j 5 maxneighbor,

48
  • Requires arbitrary objects and a distance
    function
  • ! Medoid mC representative object in a cluster C
  • ! Measure for the compactness of a Cluster C
  • ! Measure for the compactness of a clustering
  • 6
  • p C
  • C TD(C) dist( p,m )
  • k
  • i
  • TD TD Ci
  • 1
  • ( )

49
  • CLARA Kaufmann and Rousseeuw,1990
  • ! Additional parameter numlocal
  • ! Draws numlocal samples of the data set
  • ! Applies PAM on each sample
  • ! Returns the best of these sets of medoids as
    output
  • ! CLARANS Ng and Han, 1994)
  • ! Two additional parameters maxneighbor and
    numlocal
  • ! At most maxneighbor many pairs (medoid M,
    non-medoid N) are
  • evaluated in the algorithm.
  • ! The first pair (M, N) for which TDN9M is
    smaller than TDcurrent is
  • swapped (instead of the pair with the minimal
    value of TDN9M )
  • ! Finding the local minimum with this procedure
    is repeated
  • numlocal times.
  • ! Efficiency runtime(CLARANS) lt runtime(CLARA) lt
    runtime(PAM)

50
  • CLARANS(objects DB, Integer k, Real dist,
  • Integer
    numlocal, Integer maxneighbor)
  • for r from 1 to numlocal do
  • Randomly select k objects as medoids i 0
  • while i lt maxneighbor do
  • Randomly select (Medoid M, Non-medoid N)
  • Compute changeOfTD_ TDN9M TD
  • if changeOfTD lt 0 then
  • substitute M by N
  • TD TDN9M i 0
  • else i i 1
  • if TD lt TD_best then
  • TD_best TD Store current medoids
  • return Medoids

51
Hierarchical Methods
52
Hierarchical Clustering
  • Use distance matrix as clustering criteria. This
    method does not require the number of clusters k
    as an input, but needs a termination condition

53
AGNES (Agglomerative Nesting)
  • Agglomerative, Bottom-up approach
  • Merge nodes that have the least dissimilarity
  • Go on in a non-descending fashion
  • Eventually all nodes belong to the same cluster

54
DIANA (Divisive Analysis)
  • Top-down approach
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

55
A Dendrogram
  • Shows How the Clusters are Merged Hierarchically
  • Decompose data objects into a several levels of
    nested partitioning (tree of clusters), called a
    dendrogram.
  • A clustering of the data objects is obtained by
    cutting the dendrogram at the desired level, then
    each connected component forms a cluster

56
Recent Hierarchical Clustering Methods
  • Major weakness of agglomerative clustering
    methods
  • do not scale well time complexity of at least
    O(n2), where n is the number of total objects
  • can never undo what was done previously
  • Integration of hierarchical with distance-based
    clustering
  • BIRCH uses CF-tree and incrementally adjusts the
    quality of sub-clusters
  • ROCK clustering categorical data by neighbor and
    link analysis
  • CHAMELEON hierarchical clustering using dynamic
    modeling

57
BIRCH
  • Birch Balanced Iterative Reducing and Clustering
    using Hierarchies
  • Incrementally construct a CF (Clustering Feature)
    tree, a hierarchical data structure for
    multiphase clustering
  • Phase 1 scan DB to build an initial in-memory CF
    tree (a multi-level compression of the data that
    tries to preserve the inherent clustering
    structure of the data)
  • Phase 2 use an arbitrary clustering algorithm to
    cluster the leaf nodes of the CF-tree

58
Clustering Feature Vector in BIRCH
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
59
CF-Tree in BIRCH
  • Clustering feature
  • summary of the statistics for a given subcluster
    the 0-th, 1st and 2nd moments of the subcluster
    from the statistical point of view.
  • registers crucial measurements for computing
    cluster and utilizes storage efficiently
  • A CF tree is a height-balanced tree that stores
    the clustering features for a hierarchical
    clustering
  • A nonleaf node in a tree has descendants or
    children
  • The nonleaf nodes store sums of the CFs of their
    children
  • A CF tree has two parameters
  • Branching factor specify the maximum number of
    children.
  • threshold max diameter of sub-clusters stored at
    the leaf nodes

60
The CF Tree Structure
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
61
BIRCH
  • Strength
  • Scales linearly finds a good clustering with a
    single scan
  • Improves the quality with a few additional scans
  • Weakness
  • handles only numeric data
  • No natural clustering due to the specification of
    branching factor.
  • Clusters are of spherical shape

62
Density Based Clustering
63
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN
  • OPTICS (If there is time)

64
DBSCAN
  • DBSCAN (Density Based Spatial Clustering of
    Applications with Noise) is a density-based
    algorithm.
  • Relies on a density-based notion of cluster A
    cluster is defined as a maximal set of
    density-connected points
  • A point is a core point if it has more than a
    specified number of points MinPts and within a
    specified raduis Eps
  • These are points that are at the interior of a
    cluster
  • A border point has fewer than MinPts within Eps,
    but is in the neighborhood of a core point
  • A noise point is any point that is not a core
    point or a border point.

65
DBSCAN Core, Border, and Noise Points
66
DBSCAN The Algorithm
  • Algorithms
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p
    w.r.t. Eps and MinPts.
  • If p is a core point, a cluster is formed.
  • If p is a border point, no points are
    density-reachable from p and DBSCAN visits the
    next point of the database.
  • Continue the process until all of the points have
    been processed.
  • Note A point p is density-reachable from a point
    q w.r.t. Eps, MinPts if there is a chain of
    points p1, , pn, p1 q, pn p such that p2
    pn-1 are all core points.

67
DBSCAN Core, Border and Noise Points
Original Points
Point types core border and noise
Eps 10, MinPts 4
68
When DBSCAN Works Well
Original Points
  • Resistant to Noise
  • Can handle clusters of different shapes and sizes

69
When DBSCAN Does NOT Work Well
MinPts4 Eps9.75
Original Points
MinPts4 Eps9.92
  • Varying densities
  • High-dimensional data

70
End
Write a Comment
User Comments (0)
About PowerShow.com