Chap. 7 Cluster Analysis - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Chap. 7 Cluster Analysis

Description:

Ordinal Variables. Can be treated like interval-scaled. Replace ... If f is ordinal or ratio-scaled. Compute ranks and normalize, then treat as interval-scaled ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 34
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap. 7 Cluster Analysis


1
Chap. 7 Cluster Analysis
  • Data Mining

2
What is Cluster Analysis?
  • Cluster
  • A collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Cluster analysis
  • Grouping a set of data objects into clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

3
Requirements of Clustering
  • Scalability
  • Ability to deal with different types of
    attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

4
Data Structures for Clustering
  • Data matrix
  • n-object x p-attributes
  • Dissimilarity matrix
  • n-object x n-object
  • Represents difference
  • (distance) between objects

5
Different Type of Data
  • The dissimilarity d(i, j) between object are
    different for various types of data
  • Interval-scaled variables
  • Height, weight, length, temperature
  • Binary variables
  • Student?, married?
  • Nominal variables
  • Color, country
  • Ordinal variables
  • Professional rank
  • Ratio variables
  • Decay of radioactive element
  • Variables of mixed types

6
Interval-valued variables
  • Standardize data
  • Calculate the mean absolute deviation
  • Calculate the standardized measurement (z-score)
  • Example
  • (Age25, height178, weight65) ? (-1.50, 0.80,
    -0.75)

7
Interval-valued variables
  • Compute distances
  • i (xi1, xi2, , xip) and j (xj1, xj2, , xjp)
    two p-dimensional data
  • Minkowski distance
  • q1 Manhattan distance
  • q2 Euclidean distance

8
Binary Variables
  • Simple matching coefficient
  • For the symmetric binary variable
  • Jaccard coefficient
  • For the asymmetric binary variable

9
Binary Variables
  • Example
  • Distance is computed based only on the asymmetric
    variables
  • Y or P ? 1, N ? 0

10
Nominal Variables
  • A generalization of binary variable
  • More than 2 states
  • Exgt red, yellow, blue, green
  • Simple matching
  • m of matches, p total of variables
  • Encode into binary variables
  • Creating a new binary variable for each of the M
    nominal states
  • Exgt (color blue) ? (red 0, yellow 0, blue
    1, green 0)

11
Ordinal Variables
  • Can be treated like interval-scaled
  • Replace values by their rank
  • Exgt Gold ? 1, Silver ? 2, Bronz ? 3 (M 3)
  • Map the range of each variable onto 0, 1
  • Exgt Gold ? 0.0, Silver ? 0.5, Bronz ? 1.0
  • Compute the distance using methods for
    interval-scaled variables

12
Ratio-Scaled Variables
  • A measurement on a nonlinear scale (exponential)
  • AeBt or Ae-Bt
  • Logarithmic transformation
  • yif log(xif)
  • Compute the distance as interval-scaled variables
  • Change to ordinal data
  • Treat their rank as interval-scaled

13
Variables of Mixed Types
  • A database may contain various types of variables
  • Use a weighted formula to combine their effects
  • ?ij(f) 0 if if xif xjf 0 and f is
    asymmetric binary. Otherwise 1
  • If f is binary or nominal
  • dij(f) 0 if xif xjf , dij(f) 1 otherwise
  • If f is interval-based
  • use the normalized distance
  • If f is ordinal or ratio-scaled
  • Compute ranks and normalize, then treat as
    interval-scaled

14
Categorization of Major Clustering Methods
  • Partitioning
  • Construct various partitions and then evaluate
    them
  • Hierarchical
  • Create a hierarchical decomposition of the set of
    data
  • Density-based
  • Grow a cluster if density exceeds threshold
  • Grid-based
  • Quantize object space into cells
  • Model-based
  • A model is hypothesized for each of the clusters
    and find the best fit of the data

15
Partitioning Methods
  • Construct a partition of a database D of n
    objects into a set of k clusters
  • K-Means
  • Given k, Partition objects into k nonempty
    subsets
  • Choose k objects as initial cluster centers
  • Assign each object to the cluster with the
    nearest center
  • Update cluster centers as the mean point of the
    cluster
  • Go back to Step 2, stop when there is no change

16
K-Means
17
Discussion on the K-Means
  • Advantages
  • Scalable and efficient
  • O(tkn), n objects, k clusters, t
    iterations. k, t ltlt n.
  • Disadvantages
  • Applicable only when mean is defined
  • Need to specify k, the number of clusters, in
    advance
  • Unable to handle noisy data and outliers
  • Not suitable to discover clusters with non-convex
    shapes

18
Hierarchical Methods
  • Grouping data into a tree
  • Does not require the number of clusters k as an
    input

19
Agglomerative Clustering
  • AGNES
  • Merge nodes that have the least dissimilarity
  • Eventually all nodes belong to the same cluster
  • Major weakness of agglomerative clustering
  • do not scale well time complexity of at least
    O(n2)

20
Divisive Clustering
  • DIANA
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

21
Density-Based Methods
  • Group objects in dense region
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Density parameters
  • Radius ? distance to determine the neighborhood
  • MinPts Minimum number of points in neighborhood

22
Definitions
  • Core object
  • ? -neighborhood contains MinPts objects
  • Directly density-reachable
  • p is directly density-reachable from q if
  • q is a core object, and p is ? -neighborhood of
    q

p
? 10 MinPts 5
10
q
23
Definitions
  • Density-reachable
  • p is density-reachable from q if
  • there are objects p1(q), p2, pn
  • such that
  • pi1 is directly density-reachable from pi
  • Density-connected
  • p is density-connected to q if
  • there is an object o
  • such that
  • p and q are density-reachable from o

p
p1
q
q
p
o
24
DBSCAN
  • A cluster - a maximal set of density-connected
    points
  • Discovers clusters of arbitrary shape in
    databases with noise
  • Arbitrary select a point p
  • Retrieve all ? -neighborhood of p
  • If p is a core object, a cluster is formed
  • From each core object p, iteratively collects
    directly density-reachable objects
    (may merge clusters)
  • Continue the process until no new points can be
    added
  • Problem with DBSCAN
  • Selecting parameters ? and MinPts

25
DBSCAN
Outlier
Border
MinPts 5
Core
26
Grid-Based Method
  • Using multi-resolution grid data structure
  • Space is quantized into finite number of cells
  • Fast processing time independent of the of
    objects

27
Grid-Based Method
  • STING
  • The spatial area area is divided into rectangular
    cells
  • There are several levels - each cell at a high
    level is partitioned into smaller cells in the
    next lower level
  • Statistical info of each cell is calculated and
    stored beforehand and is used to answer queries
  • Advantages
  • Easy to parallelize, incremental update
  • O(k), where k is the number of grid cells at the
    lowest level
  • Disadvantages
  • All the cluster boundaries are either horizontal
    or vertical, and no diagonal boundary is detected

28
CLIQUE
  • Grid-based density-based
  • Partition m-dimensional data space into
    non-overlapping rectangular units
  • A unit is dense if the fraction of total data
    points contained in the unit exceeds the input
    model parameter
  • A cluster is a maximal set of connected dense
    units within a subspace
  • Identify the clusters using the Apriori principle
  • Intersection of 2-D dense region ? 3-D dense
    region candidate

29
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
30
Model-Based Methods
  • Attempt to fit the data to some mathematical
    model
  • Conceptual clustering
  • Produces a classification scheme
  • Finds characteristic description for each concept
    (class)
  • COBWEB
  • Neural network
  • Represent each cluster as an examplar (prototype)
  • Competitive learning, SOM

31
COBWEB
  • Simple method of incremental conceptual learning
  • Objects - represented as attribute-value pairs
  • Output - classification tree. Each node refers to
    a concept
  • Category utility sum of
  • Intraclass similarity (P(AVC))
  • High ? more likely objects in C share V
  • Interclass dissimilarity (P(CAV))
  • High ? less likely objects not in C have V
  • Method
  • For a new object, classify it
  • Compute category utility
  • Split, merge, or make new category so that the
    utility increases

32
COBWEB
33
Neural Networks
  • Competitive Learning, SOM (Self-organizing maps)
  • Clustering is performed by having several units
    (neurons) competing for the current object
  • The unit whose weight vector is closest to the
    object wins
  • The winner and its neighbors adjust their weights
  • Resemble processing that can occur in the brain

X
W1
W1
x1
XW1
win
x2
XW2
W2
W2
?W1 c (X - W1)
34
Outlier Analysis
  • What are outliers?
  • The set of objects that do not comply with the
    general behavior or model of the data
  • Exgt Age -999, credit card usage per day 25
  • Problem
  • Given n data points, find top k objects that are
    considerably dissimilar/exceptional/inconsistent
    with others
  • Approaches
  • Statistical-based Find objects that are very
    large/small given a distribution model
  • Distance-based Find objects that does not have
    enough neighbor
  • Deviation-based Find objects that deviate from
    description of a group
Write a Comment
User Comments (0)
About PowerShow.com