Chap. 7 Cluster Analysis

About This Presentation

Title:

Chap. 7 Cluster Analysis

Description:

Ordinal Variables. Can be treated like interval-scaled. Replace ... If f is ordinal or ratio-scaled. Compute ranks and normalize, then treat as interval-scaled ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 34

Provided by: jiaw185

Category:

more less

Transcript and Presenter's Notes

Title: Chap. 7 Cluster Analysis

1
Chap. 7 Cluster Analysis

Data Mining

2
What is Cluster Analysis?

Cluster
A collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

3
Requirements of Clustering

Scalability
Ability to deal with different types of
attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

4
Data Structures for Clustering

Data matrix
n-object x p-attributes
Dissimilarity matrix
n-object x n-object
Represents difference
(distance) between objects

5
Different Type of Data

The dissimilarity d(i, j) between object are
different for various types of data
Interval-scaled variables
Height, weight, length, temperature
Binary variables
Student?, married?
Nominal variables
Color, country
Ordinal variables
Professional rank
Ratio variables
Decay of radioactive element
Variables of mixed types

6
Interval-valued variables

Standardize data
Calculate the mean absolute deviation
Calculate the standardized measurement (z-score)
Example
(Age25, height178, weight65) ? (-1.50, 0.80,
-0.75)

7
Interval-valued variables

Compute distances
i (xi1, xi2, , xip) and j (xj1, xj2, , xjp)
two p-dimensional data
Minkowski distance
q1 Manhattan distance
q2 Euclidean distance

8
Binary Variables

Simple matching coefficient
For the symmetric binary variable
Jaccard coefficient
For the asymmetric binary variable

9
Binary Variables

Example
Distance is computed based only on the asymmetric
variables
Y or P ? 1, N ? 0

10
Nominal Variables

A generalization of binary variable
More than 2 states
Exgt red, yellow, blue, green
Simple matching
m of matches, p total of variables
Encode into binary variables
Creating a new binary variable for each of the M
nominal states
Exgt (color blue) ? (red 0, yellow 0, blue
1, green 0)

11
Ordinal Variables

Can be treated like interval-scaled
Replace values by their rank
Exgt Gold ? 1, Silver ? 2, Bronz ? 3 (M 3)
Map the range of each variable onto 0, 1
Exgt Gold ? 0.0, Silver ? 0.5, Bronz ? 1.0
Compute the distance using methods for
interval-scaled variables

12
Ratio-Scaled Variables

A measurement on a nonlinear scale (exponential)
AeBt or Ae-Bt
Logarithmic transformation
yif log(xif)
Compute the distance as interval-scaled variables
Change to ordinal data
Treat their rank as interval-scaled

13
Variables of Mixed Types

A database may contain various types of variables
Use a weighted formula to combine their effects
?ij(f) 0 if if xif xjf 0 and f is
asymmetric binary. Otherwise 1
If f is binary or nominal
dij(f) 0 if xif xjf , dij(f) 1 otherwise
If f is interval-based
use the normalized distance
If f is ordinal or ratio-scaled
Compute ranks and normalize, then treat as
interval-scaled

14
Categorization of Major Clustering Methods

Partitioning
Construct various partitions and then evaluate
them
Hierarchical
Create a hierarchical decomposition of the set of
data
Density-based
Grow a cluster if density exceeds threshold
Grid-based
Quantize object space into cells
Model-based
A model is hypothesized for each of the clusters
and find the best fit of the data

15
Partitioning Methods

Construct a partition of a database D of n
objects into a set of k clusters
K-Means
Given k, Partition objects into k nonempty
subsets
Choose k objects as initial cluster centers
Assign each object to the cluster with the
nearest center
Update cluster centers as the mean point of the
cluster
Go back to Step 2, stop when there is no change

16
K-Means
17
Discussion on the K-Means

Advantages
Scalable and efficient
O(tkn), n objects, k clusters, t
iterations. k, t ltlt n.
Disadvantages
Applicable only when mean is defined
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes

18
Hierarchical Methods

Grouping data into a tree
Does not require the number of clusters k as an
input

19
Agglomerative Clustering

AGNES
Merge nodes that have the least dissimilarity
Eventually all nodes belong to the same cluster
Major weakness of agglomerative clustering
do not scale well time complexity of at least
O(n2)

20
Divisive Clustering

DIANA
Inverse order of AGNES
Eventually each node forms a cluster on its own

21
Density-Based Methods

Group objects in dense region
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Density parameters
Radius ? distance to determine the neighborhood
MinPts Minimum number of points in neighborhood

22
Definitions

Core object
? -neighborhood contains MinPts objects
Directly density-reachable
p is directly density-reachable from q if
q is a core object, and p is ? -neighborhood of
q

p
? 10 MinPts 5
10
q
23
Definitions

Density-reachable
p is density-reachable from q if
there are objects p1(q), p2, pn
such that
pi1 is directly density-reachable from pi
Density-connected
p is density-connected to q if
there is an object o
such that
p and q are density-reachable from o

p
p1
q
q
p
o
24
DBSCAN

A cluster - a maximal set of density-connected
points
Discovers clusters of arbitrary shape in
databases with noise
Arbitrary select a point p
Retrieve all ? -neighborhood of p
If p is a core object, a cluster is formed
From each core object p, iteratively collects
directly density-reachable objects
(may merge clusters)
Continue the process until no new points can be
added
Problem with DBSCAN
Selecting parameters ? and MinPts

25
DBSCAN
Outlier
Border
MinPts 5
Core
26
Grid-Based Method

Using multi-resolution grid data structure
Space is quantized into finite number of cells
Fast processing time independent of the of
objects

27
Grid-Based Method

STING
The spatial area area is divided into rectangular
cells
There are several levels - each cell at a high
level is partitioned into smaller cells in the
next lower level
Statistical info of each cell is calculated and
stored beforehand and is used to answer queries
Advantages
Easy to parallelize, incremental update
O(k), where k is the number of grid cells at the
lowest level
Disadvantages
All the cluster boundaries are either horizontal
or vertical, and no diagonal boundary is detected

28
CLIQUE

Grid-based density-based
Partition m-dimensional data space into
non-overlapping rectangular units
A unit is dense if the fraction of total data
points contained in the unit exceeds the input
model parameter
A cluster is a maximal set of connected dense
units within a subspace
Identify the clusters using the Apriori principle
Intersection of 2-D dense region ? 3-D dense
region candidate

29
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
30
Model-Based Methods

Attempt to fit the data to some mathematical
model
Conceptual clustering
Produces a classification scheme
Finds characteristic description for each concept
(class)
COBWEB
Neural network
Represent each cluster as an examplar (prototype)
Competitive learning, SOM

31
COBWEB

Simple method of incremental conceptual learning
Objects - represented as attribute-value pairs
Output - classification tree. Each node refers to
a concept
Category utility sum of
Intraclass similarity (P(AVC))
High ? more likely objects in C share V
Interclass dissimilarity (P(CAV))
High ? less likely objects not in C have V
Method
For a new object, classify it
Compute category utility
Split, merge, or make new category so that the
utility increases

32
COBWEB
33
Neural Networks

Competitive Learning, SOM (Self-organizing maps)
Clustering is performed by having several units
(neurons) competing for the current object
The unit whose weight vector is closest to the
object wins
The winner and its neighbors adjust their weights
Resemble processing that can occur in the brain

X
W1
W1
x1
XW1
win
x2
XW2
W2
W2
?W1 c (X - W1)
34
Outlier Analysis

What are outliers?
The set of objects that do not comply with the
general behavior or model of the data
Exgt Age -999, credit card usage per day 25
Problem
Given n data points, find top k objects that are
considerably dissimilar/exceptional/inconsistent
with others
Approaches
Statistical-based Find objects that are very
large/small given a distribution model
Distance-based Find objects that does not have
enough neighbor
Deviation-based Find objects that deviate from
description of a group

Write a Comment

User Comments (0)

About PowerShow.com

Chap. 7 Cluster Analysis - PowerPoint PPT Presentation

Chap. 7 Cluster Analysis

Ordinal Variables. Can be treated like interval-scaled. Replace ... If f is ordinal or ratio-scaled. Compute ranks and normalize, then treat as interval-scaled ... – PowerPoint PPT presentation